[00:19:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24475 and previous config saved to /var/cache/conftool/dbconfig/20220412-001923-ladsgroup.json [00:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:34:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24476 and previous config saved to /var/cache/conftool/dbconfig/20220412-003428-ladsgroup.json [00:34:30] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:11] (03PS1) 10RLazarus: sretest: Uninstall external_clouds_vendors [puppet] - 10https://gerrit.wikimedia.org/r/779145 (https://phabricator.wikimedia.org/T270391) [00:43:13] (03PS1) 10RLazarus: sretest: Remove absented external_clouds_vendors [puppet] - 10https://gerrit.wikimedia.org/r/779146 (https://phabricator.wikimedia.org/T270391) [00:45:02] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34785/console" [puppet] - 10https://gerrit.wikimedia.org/r/779145 (https://phabricator.wikimedia.org/T270391) (owner: 10RLazarus) [00:49:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24477 and previous config saved to /var/cache/conftool/dbconfig/20220412-004933-ladsgroup.json [00:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:08] (03PS1) 10RLazarus: external_clouds_vendors: Remove migration shim for T305581 [puppet] - 10https://gerrit.wikimedia.org/r/779149 (https://phabricator.wikimedia.org/T305581) [00:59:16] (BlazegraphJvmQuakeWarnGC) firing: (2) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220412T0100) [01:04:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24478 and previous config saved to /var/cache/conftool/dbconfig/20220412-010438-ladsgroup.json [01:04:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [01:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [01:04:43] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:48] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:06:39] (03CR) 10RLazarus: "No longer needed on puppetmaster frontends:" [puppet] - 10https://gerrit.wikimedia.org/r/779149 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [01:21:35] (03PS1) 10RLazarus: external_cloud_vendors: Add a known-clients/Googlebot ipblock [puppet] - 10https://gerrit.wikimedia.org/r/779157 (https://phabricator.wikimedia.org/T305581) [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:54:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [01:54:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [01:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:49] (03CR) 10RLazarus: "Please use this patch to bikeshed over the name `known-clients` for this type of ipblock." [puppet] - 10https://gerrit.wikimedia.org/r/779157 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [02:01:46] PROBLEM - Host ms-fe1012 is DOWN: PING CRITICAL - Packet loss = 100% [02:03:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:03:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:04] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:07:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.7 [core] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/779164 [02:07:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.7 [core] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/779164 (owner: 10TrainBranchBot) [02:17:39] (NodeTextfileStale) firing: Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:24:26] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.7 [core] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/779164 (owner: 10TrainBranchBot) [02:27:50] PROBLEM - MariaDB Replica Lag: s3 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1314.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:28:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:28:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:39] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:33:22] (03PS1) 10Ladsgroup: labs: Set actor migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779166 (https://phabricator.wikimedia.org/T275246) [02:35:54] (03CR) 10Ladsgroup: [C: 03+2] labs: Set actor migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779166 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [02:36:32] (03Merged) 10jenkins-bot: labs: Set actor migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779166 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [02:41:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [02:41:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [02:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:43:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:10] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [02:55:38] PROBLEM - MariaDB Replica Lag: s3 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1401.71 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:57:34] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [03:08:13] 10SRE, 10DBA, 10observability, 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), and 2 others: Send metrics of db errors of mediawiki to prometheus - https://phabricator.wikimedia.org/T297435 (10Ladsgroup) a:05Ladsgroup→03None Wont' be able to do it soon. [03:09:07] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [03:20:38] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [03:32:17] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [03:36:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [03:36:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [03:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24479 and previous config saved to /var/cache/conftool/dbconfig/20220412-033633-ladsgroup.json [03:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:38:30] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:43:48] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [03:43:54] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10paramita_das) Hi... Yes, I have set up access to the cluster. Thanks a lot! [03:46:08] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 2 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [03:46:38] PROBLEM - Host an-worker1136 is DOWN: PING CRITICAL - Packet loss = 100% [03:46:38] RECOVERY - Host an-worker1136 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [04:02:20] RECOVERY - MariaDB Replica Lag: s3 on db2139 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:12:28] RECOVERY - MariaDB Replica Lag: s3 on db1145 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:31:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24480 and previous config saved to /var/cache/conftool/dbconfig/20220412-043118-ladsgroup.json [04:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:38:59] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:46:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24481 and previous config saved to /var/cache/conftool/dbconfig/20220412-044623-ladsgroup.json [04:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:40] (03PS2) 10Marostegui: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/778688 (https://phabricator.wikimedia.org/T304933) [04:49:54] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 31 hosts with reason: Primary switchover s4 T304933 [04:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:58] T304933: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T304933 [04:50:14] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 31 hosts with reason: Primary switchover s4 T304933 [04:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1160 with weight 0 T304933', diff saved to https://phabricator.wikimedia.org/P24482 and previous config saved to /var/cache/conftool/dbconfig/20220412-045023-root.json [04:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:16] (BlazegraphJvmQuakeWarnGC) firing: (2) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [05:01:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24483 and previous config saved to /var/cache/conftool/dbconfig/20220412-050128-ladsgroup.json [05:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:13] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:16:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24484 and previous config saved to /var/cache/conftool/dbconfig/20220412-051633-ladsgroup.json [05:16:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [05:16:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [05:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/778688 (https://phabricator.wikimedia.org/T304933) (owner: 10Marostegui) [05:25:15] (03PS1) 10Marostegui: db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/779331 (https://phabricator.wikimedia.org/T302950) [05:25:41] (03PS1) 10Marostegui: Revert "mariadb: Disable notifications on db hosts for B1" [puppet] - 10https://gerrit.wikimedia.org/r/779108 [05:25:50] (03CR) 10Marostegui: [C: 03+2] db1138: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/779331 (https://phabricator.wikimedia.org/T302950) (owner: 10Marostegui) [05:26:33] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications on db hosts for B1" [puppet] - 10https://gerrit.wikimedia.org/r/779108 (owner: 10Marostegui) [06:00:04] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220412T0600). [06:00:10] o/ [06:00:12] o/ [06:00:28] Amir1: Can you test an upload once I am done? [06:00:33] I will test a write on my talk page [06:00:37] sure [06:00:43] cool, going for it then [06:00:47] !log Starting s4 eqiad failover from db1138 to db1160 - T304933 [06:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:51] T304933: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T304933 [06:00:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T304933', diff saved to https://phabricator.wikimedia.org/P24485 and previous config saved to /var/cache/conftool/dbconfig/20220412-060057-root.json [06:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:21] RO confirmed [06:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1160 to s4 primary and set section read-write T304933', diff saved to https://phabricator.wikimedia.org/P24486 and previous config saved to /var/cache/conftool/dbconfig/20220412-060125-root.json [06:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:29] all done [06:01:41] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:01:41] I can write [06:01:44] And back to RW [06:02:23] https://test-commons.wikimedia.org/wiki/File:282px-Rock_hyrax_(Procavia_capensis)_2.jpg [06:02:41] hmm, the media backend seems failing but that can be anything with test commons [06:02:49] let me see if new files in commons work [06:02:51] yeah, I was going to say [06:03:34] I can see files on recentchanges [06:03:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:03:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:11] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s4 CNAME [dns] - 10https://gerrit.wikimedia.org/r/778689 (https://phabricator.wikimedia.org/T304933) (owner: 10Marostegui) [06:04:12] I see new files showing up [06:04:56] https://commons.wikimedia.org/wiki/File:WMF_desktop_background_green_-_TEST.png [06:05:05] excellent! [06:05:14] commons work, now I need to find an admin in commons to delete it [06:05:26] xddd [06:05:31] and also we should fix file backend for the test commons [06:05:46] So we are done, read only time was 28 seconds [06:06:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1138 T304933', diff saved to https://phabricator.wikimedia.org/P24487 and previous config saved to /var/cache/conftool/dbconfig/20220412-060628-root.json [06:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:33] T304933: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T304933 [06:07:22] https://upload.wikimedia.org/wikipedia/test-commons/6/66/Screenshot_2021-11-30_024926.png [06:07:36] Unauthorized [06:07:36] This server could not verify that you are authorized to access the document you requested. [06:08:00] anyway [06:08:24] Amir1: but that is test-commons specific, right? [06:08:41] yeah [06:09:00] I think we have some bad rules in varnish or swift, I'll check later [06:09:23] marostegui: the burning question, who wants to do the clean up of db1138? [06:09:33] I will take care of that [06:09:36] it has bullseye upgrade + templatelinks and a couple more [06:09:54] awesome [06:16:05] marostegui: have you planned anything for next week? [06:16:43] not yet [06:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:43:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1138.eqiad.wmnet with OS bullseye [06:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:48:57] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:49:37] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47967 bytes in 6.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:50:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:50:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:50:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24489 and previous config saved to /var/cache/conftool/dbconfig/20220412-065102-ladsgroup.json [06:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:51:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1138.eqiad.wmnet with reason: host reimage [06:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:45] (JobUnavailable) firing: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:55:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1138.eqiad.wmnet with reason: host reimage [06:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:59:40] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47965 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:04] Amir1, awight, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220412T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:04:17] !log dbmaint s4@eqiad T300381 [07:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:22] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:05:32] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1138.eqiad.wmnet with OS bullseye [07:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:18] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:11:02] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:01] (BlazegraphJvmQuakeWarnGC) firing: (2) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [07:16:16] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:01] (BlazegraphJvmQuakeWarnGC) resolved: (2) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [07:19:25] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10jcrespo) a:03KFrancis [07:20:50] (03PS1) 10Kevin Bazira: ml-services: add ruwiki, sqwiki & srwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/779438 (https://phabricator.wikimedia.org/T301415) [07:28:43] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:25] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24490 and previous config saved to /var/cache/conftool/dbconfig/20220412-074433-ladsgroup.json [07:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:45:22] (03PS1) 10DCausse: wdqs: activate jvmquake at 300:5 [puppet] - 10https://gerrit.wikimedia.org/r/779440 (https://phabricator.wikimedia.org/T293862) [07:46:43] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:16] (03PS2) 10DCausse: wdqs: activate jvmquake at 300:5 [puppet] - 10https://gerrit.wikimedia.org/r/779440 (https://phabricator.wikimedia.org/T293862) [07:57:54] !log gmodena@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [07:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24491 and previous config saved to /var/cache/conftool/dbconfig/20220412-075938-ladsgroup.json [07:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:51] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:15] !log gmodena@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 05m 20s) [08:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:39] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:10] (03PS2) 10JMeybohm: Add all members of the ops group to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) [08:08:12] (03PS2) 10JMeybohm: Switch default group for Kubernetes credentials files to deployment [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) [08:08:49] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:59] (03PS3) 10JMeybohm: Add all members of the ops group to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) [08:09:01] (03PS3) 10JMeybohm: Switch default group for Kubernetes credentials files to deployment [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) [08:09:16] (03CR) 10JMeybohm: Add all members of the ops group to the deployment group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [08:09:36] (03CR) 10JMeybohm: Switch default group for Kubernetes credentials files to deployment (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [08:11:35] PROBLEM - Host analytics1076 is DOWN: PING CRITICAL - Packet loss = 100% [08:14:03] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24492 and previous config saved to /var/cache/conftool/dbconfig/20220412-081443-ladsgroup.json [08:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:13] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01003 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:17:53] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:45] (03PS2) 10Jcrespo: admin: Add drochford to analytics-privatedata-users for superset [puppet] - 10https://gerrit.wikimedia.org/r/779024 (https://phabricator.wikimedia.org/T305634) [08:20:47] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:28] (03CR) 10Jcrespo: [C: 03+2] admin: Add drochford to analytics-privatedata-users for superset [puppet] - 10https://gerrit.wikimedia.org/r/779024 (https://phabricator.wikimedia.org/T305634) (owner: 10Jcrespo) [08:29:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24493 and previous config saved to /var/cache/conftool/dbconfig/20220412-082948-ladsgroup.json [08:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:29:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:29:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24494 and previous config saved to /var/cache/conftool/dbconfig/20220412-083000-ladsgroup.json [08:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:39] (03CR) 10Vgutierrez: [C: 03+1] "+1 to the known-clients block. I'd also add https://www.gstatic.com/ipranges/goog.json cause other Google crawlers won't use the googlebot" [puppet] - 10https://gerrit.wikimedia.org/r/779157 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [08:36:45] (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:38:29] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:41] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:46] !log dbmaint s4@eqiad T298556 [08:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:50] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [08:51:57] !log dbmaint s4@eqiad T298294 [08:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:00] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:59:07] (03CR) 10Gehel: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [09:01:38] !log dbmaint s7@eqiad T297189 [09:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:41] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [09:06:48] (03Merged) 10jenkins-bot: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [09:09:07] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:29] !log dbmaint s4@eqiad T298557 [09:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:35] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:15:47] PROBLEM - Check systemd state on ml-staging-ctrl2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:39] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:18:43] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/779149 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [09:20:13] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24495 and previous config saved to /var/cache/conftool/dbconfig/20220412-092204-ladsgroup.json [09:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:26:59] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2071 db2072', diff saved to https://phabricator.wikimedia.org/P24497 and previous config saved to /var/cache/conftool/dbconfig/20220412-092730-root.json [09:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool all codfw hosts that went down for on-site maintenance', diff saved to https://phabricator.wikimedia.org/P24498 and previous config saved to /var/cache/conftool/dbconfig/20220412-092846-root.json [09:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:57] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/779157 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [09:30:39] 10SRE, 10Wikimedia-Mailing-lists: Mailman3: 550-Support for list subscription via email has been disabled. - https://phabricator.wikimedia.org/T303888 (10jcrespo) p:05Medium→03Lowest From my understanding of the ticket: * Given the disabling of mail subscription is done on purpose * The advertised mail co... [09:34:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10jcrespo) Access has been deployed, @drochford can you test access? [09:35:05] (03CR) 10Vgutierrez: [C: 03+1] sretest: Uninstall external_clouds_vendors [puppet] - 10https://gerrit.wikimedia.org/r/779145 (https://phabricator.wikimedia.org/T270391) (owner: 10RLazarus) [09:35:37] (03CR) 10Vgutierrez: [C: 03+1] sretest: Remove absented external_clouds_vendors [puppet] - 10https://gerrit.wikimedia.org/r/779146 (https://phabricator.wikimedia.org/T270391) (owner: 10RLazarus) [09:36:11] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/779145 (https://phabricator.wikimedia.org/T270391) (owner: 10RLazarus) [09:36:25] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/779146 (https://phabricator.wikimedia.org/T270391) (owner: 10RLazarus) [09:37:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24499 and previous config saved to /var/cache/conftool/dbconfig/20220412-093709-ladsgroup.json [09:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:03] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:42] (03PS6) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332 [09:43:44] (03PS1) 10Volans: cloud vendors: force yaml output format [puppet] - 10https://gerrit.wikimedia.org/r/779444 (https://phabricator.wikimedia.org/T305581) [09:44:14] !log running logrotate /etc/logrotate.d/rsyslog --force on ml-staging-ctrl2001 (no space left on device) [09:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:35] klausman: ^^ - tons of kube api errors filling up the disk there [09:44:53] I'll take a look [09:45:17] logrotate is not yet finished, so logs might be moving around [09:45:37] let me know if you can use some help! [09:45:58] The problem is likely that the setup iof the cluster is not yet complete, so it logs errors about not being able to talk to some components [09:46:10] (03PS1) 10David Caro: wmcs: Remove unused role wmcs::nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/779446 (https://phabricator.wikimedia.org/T291405) [09:46:37] I see. Maybe stopping API server would be an option? [09:48:21] yep, that's the plan [09:49:15] the "interesting" part will be what then blows up next [09:49:56] (03CR) 10Vgutierrez: [C: 04-1] "looking good, please address the comment" [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:50:35] The other great thing is that these messages all end up in the systemd journal *and* `syslog` *and* `daemon.log` [09:50:39] So triple-whammy [09:51:11] (03PS3) 10Zabe: sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) [09:51:17] (03CR) 10Zabe: sslcert: migrate update-ocsp-all cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:52:03] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24500 and previous config saved to /var/cache/conftool/dbconfig/20220412-095214-ladsgroup.json [09:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:39] I've stopped the apiservers [09:53:41] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:53:53] and there we go. [09:53:56] :) [09:54:10] logrotate is done btw [09:54:16] thanks! [09:54:39] I'm nominally out sick (ish), if you hand't pinged me I would've missed this entirely [09:54:53] Luca is on PTO so we're 0.3/0 on SRE atm [09:55:27] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:55:34] Oh, sorry about that! I pinged you because I knew L.uca is out - hadn't done so if I knew you're out sick [09:55:59] RECOVERY - Disk space on ml-staging-ctrl2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [09:56:02] It's ok. I have COVID but it's on the way out, I'm just still under the weather as they say [09:57:41] I've depooled both servers. [09:58:21] ack. Feel free to ping me if anything somes up that needs to be taken care of! [09:58:26] also, it's a bit odd that Icinga warns about NTP being disabled. Isn't that normal on VMs? [09:58:55] PROBLEM - Check systemd state on ml-staging-ctrl2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:12] here it comes :) - I'll run logrotate there as well [09:59:55] ty [10:00:45] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:01:13] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2130', diff saved to https://phabricator.wikimedia.org/P24501 and previous config saved to /var/cache/conftool/dbconfig/20220412-100147-root.json [10:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:06] !log running logrotate /etc/logrotate.d/rsyslog --force on ml-staging-ctrl2001 (no space left on device) [10:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:45] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.72:6443]) https://wikitech.wikimedia.org/wiki/PyBal [10:02:45] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.72:6443]) https://wikitech.wikimedia.org/wiki/PyBal [10:03:07] urgh, just more yaks [10:05:44] What is the right way to deal with this. apiserver is off so it doesn't fill the disk, but that means I have to depool the machine so LVS sn't unhappy, and now pybal is unhappy because IPVS doesn't know the service. [10:06:06] (not that depooling _actually_ helped...) [10:06:22] how is set in confctl? pooled=no or pooled=inactive? [10:06:38] I just used "sudo depool" [10:06:40] the first keeps the endpoint known to LVS, the latter removes it completely [10:06:49] that's pooled=no [10:06:49] RECOVERY - Disk space on ml-staging-ctrl2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2001&var-datasource=codfw+prometheus/ops [10:07:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24502 and previous config saved to /var/cache/conftool/dbconfig/20220412-100719-ladsgroup.json [10:07:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:07:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:08] Mh, the "not known to" bit might actually be missing config [10:08:46] I got right up to config'ing pybal, but it was a Fri afternoon, so I delayed to the Monday following and then Other Stuff happened [10:09:13] Mhno, the alert has only been active for 10m [10:10:23] If my brain actually worked, that would help :-S [10:11:17] So.. you stated before that you've depooled both servers [10:11:36] That goes against the depool threshold so pybal refused to depool the second server [10:11:46] That's all expected [10:12:01] ah. [10:13:33] Am I supposed to force it, then? Or what is the ultimately right course of action? (aside from setting up the whole thing properly, which I can't be trusted with, atm) [10:14:02] what are you trying to achieve? [10:14:21] Other people not seeing spurious alerts because of our half-setu-up cluster [10:14:44] (and alerts not masking real problems) [10:18:50] right, the main problem here is that pybal/lvs doesn't support a service without servers [10:19:21] so if that's the current state of affairs you should remove the service from the load balancers [10:19:49] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:21:29] I'll try and make a patch [10:23:01] (03PS1) 10Klausman: Remove LVS setup for ml-staging-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/779449 [10:24:08] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34787/console" [puppet] - 10https://gerrit.wikimedia.org/r/779449 (owner: 10Klausman) [10:24:59] (03CR) 10Vgutierrez: [C: 04-1] "One more thing that I didn't spotted till I ran PCC against a cp server" [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:25:15] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34788/console" [puppet] - 10https://gerrit.wikimedia.org/r/779449 (owner: 10Klausman) [10:26:16] klausman: hmm why remove the service definition entirely instead of rollbacking the service state to service_setup? [10:26:31] My brain no worky? [10:27:20] (03PS2) 10Klausman: Remove LVS setup for ml-staging-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/779449 [10:28:16] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34789/console" [puppet] - 10https://gerrit.wikimedia.org/r/779449 (owner: 10Klausman) [10:29:01] ok, that will require a manual step, basically removing the ipvs entry for the service after restarting pybal [10:30:22] I'm gonna need guidance on that [10:30:45] ipvsadm --delete-service --tcp-service 10.2.1.72:6443 on both lvs2009 and lvs2010 [10:31:24] So, once I get a review, merge, puppet-merge, (??? gott alook up pybal restart), ipvsadm [10:31:34] (03CR) 10Vgutierrez: [C: 03+1] Remove LVS setup for ml-staging-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/779449 (owner: 10Klausman) [10:31:43] (03PS4) 10Zabe: sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) [10:32:13] (03CR) 10Klausman: [V: 03+1 C: 03+2] Remove LVS setup for ml-staging-ctrl [puppet] - 10https://gerrit.wikimedia.org/r/779449 (owner: 10Klausman) [10:32:14] klausman: 1. merge, 2. restart pybal on lvs2010. 3. Clean ipvs entry on lvs2010. 4. Check that lvs2010 is all green 5. Proceed with lvs2009 [10:32:39] Ok. Is restart just a systemd service? [10:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:33:01] 10SRE: SRE team to please de-link my aodit@wikimedia.org staff email from personal volunteer profile - https://phabricator.wikimedia.org/T305919 (10Astuthiodit_1) [10:33:21] klausman: yes, systemctl restart pybal [10:33:28] of course log it on SAL :) [10:33:47] after running puppet on lvs2010 of course :) [10:33:51] aye [10:35:11] !log restarting pybal on lvs2010 for change 779449 [10:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:19] (03CR) 10Vgutierrez: [C: 04-1] sslcert: migrate update-ocsp-all cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:37:13] (03PS5) 10Zabe: sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) [10:37:41] (03CR) 10Zabe: sslcert: migrate update-ocsp-all cron to systemd timer job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:38:03] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:38:09] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:38:19] yaaay [10:38:34] lvs2010 is all-green [10:38:40] proceeding with 2009 [10:38:46] ack [10:39:27] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/34791/cp3050.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:39:40] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34792/console" [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:40:25] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:41:09] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:41:25] klausman: hmm you forgot to log the pybal restart on lvs2009? :) [10:41:34] !log restarting pybal on lvs2009 for change 779449 [10:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:36] no :) [10:41:50] I *did* forget pressing return [10:42:17] 2009 is now also all-green [10:42:22] great [10:42:27] (03CR) 10Zabe: [C: 03+1] Add Wikistories extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773212 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [10:42:28] also, I didn't find any IPVS entries for ml*staging* [10:42:49] So that was easier. [10:43:13] (03CR) 10Urbanecm: [C: 03+1] "since wmf.6 is up and won't rollback, this should be fine" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773212 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [10:44:11] I think this should cover it. Thanks Valentin, Riccardo and Jayme for your help. I'mm go lie down now. [10:44:22] take care klausman [10:44:31] get well! [10:48:58] (03CR) 10Urbanecm: [C: 04-1] "This change is ready for review." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773334 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [10:54:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:54:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:24] (03CR) 10Daimona Eaytoy: [C: 03+1] Temporarily undeprecate EditPage::$textbox2 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/778641 (https://phabricator.wikimedia.org/T305028) (owner: 10Thiemo Kreuz (WMDE)) [10:56:07] RECOVERY - Host ms-fe1012 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [10:58:35] !log dbmaint s4@eqiad T300992 [10:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:38] (03CR) 10Urbanecm: [C: 04-1] Enable Wikistories on enwiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773334 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [10:58:38] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [11:00:04] Lucas_WMDE and hoo: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Maintenance script run deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220412T1100). [11:00:42] I’m calling off that maintenance script run, I noticed something in the script that I don’t think we want [11:01:01] I’ll file another deployment window for that later [11:01:08] anyone else is free to deploy now as far as I’m concerned [11:01:53] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:06:41] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:07:29] 10SRE, 10Wikimedia-Mailing-lists: hyperkitty didn't import all wikitech-l messages - https://phabricator.wikimedia.org/T281070 (10jcrespo) I checked a few messages mentioned and they seem to have been correctly imported now, including those having emojis. Can someone double check so we can resolve this (or giv... [11:11:24] !log dbmaint s4@eqiad T298554 [11:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:28] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [11:23:19] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:36:58] !log dbmaint s4@eqiad T300775 [11:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:02] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [11:41:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:41:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24504 and previous config saved to /var/cache/conftool/dbconfig/20220412-114152-ladsgroup.json [11:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:47:49] (03PS1) 10Majavah: openstack: remove enc api from puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/779460 (https://phabricator.wikimedia.org/T295247) [11:49:40] (03CR) 10jerkins-bot: [V: 04-1] openstack: remove enc api from puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/779460 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [11:58:06] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34795/console" [puppet] - 10https://gerrit.wikimedia.org/r/779460 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [11:58:42] (03PS2) 10Majavah: openstack: remove enc api from puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/779460 (https://phabricator.wikimedia.org/T295247) [11:59:12] (03CR) 10Zabe: [C: 04-1] Enable Wikistories on enwiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773334 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [11:59:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34796/console" [puppet] - 10https://gerrit.wikimedia.org/r/779460 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [12:12:29] (03PS5) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) [12:12:31] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: introduce bullseye-wikimedia/component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) [12:12:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2071', diff saved to https://phabricator.wikimedia.org/P24505 and previous config saved to /var/cache/conftool/dbconfig/20220412-121254-root.json [12:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2088:3311', diff saved to https://phabricator.wikimedia.org/P24506 and previous config saved to /var/cache/conftool/dbconfig/20220412-121744-root.json [12:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:36] (03PS1) 10Lucas Werkmeister (WMDE): Don’t use session-consistent connections in UnexpectedUnconnectedPagePrimer [extensions/Wikibase] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/779109 [12:21:48] (03PS1) 10Lucas Werkmeister (WMDE): Don’t use session-consistent connections in UnexpectedUnconnectedPagePrimer [extensions/Wikibase] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/779110 [12:23:22] (03PS6) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) [12:23:52] (just reiterating that we’re not using the currently ongoing “maintenance script run” window, so anyone else is free to deploy as far as I’m concerned :) [12:24:20] (03CR) 10David Caro: [C: 03+1] "Just nits, feel free to ignore" [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [12:24:33] (03CR) 10David Caro: [C: 03+1] "also, pcc might be nice" [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [12:25:08] (03CR) 10David Caro: "Should the stack be reversed? (so the package is available before puppet tries to pull it?)" [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [12:38:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24507 and previous config saved to /var/cache/conftool/dbconfig/20220412-123845-ladsgroup.json [12:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:41:40] (03PS6) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) [12:41:42] (03PS3) 10Arturo Borrero Gonzalez: aptrepo: introduce bullseye-wikimedia/component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) [12:44:05] (03PS7) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) [12:44:07] (03PS4) 10Arturo Borrero Gonzalez: aptrepo: introduce bullseye-wikimedia/component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) [12:46:46] (03PS8) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) [12:46:48] (03PS5) 10Arturo Borrero Gonzalez: aptrepo: introduce bullseye-wikimedia/component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) [12:48:45] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [12:50:04] (03PS9) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) [12:50:06] (03PS6) 10Arturo Borrero Gonzalez: aptrepo: introduce bullseye-wikimedia/component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) [12:50:31] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [12:51:38] !log modify loopback filter on cr3-ulsfo to add terms needed in evpn context T304553 [12:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:44] T304553: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 [12:52:16] (03CR) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [12:53:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24508 and previous config saved to /var/cache/conftool/dbconfig/20220412-125350-ladsgroup.json [12:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:18] (03CR) 10Vivian Rook: [C: 03+1] P:toolforge::prometheus: remove paws jobs [puppet] - 10https://gerrit.wikimedia.org/r/778673 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [12:55:58] (03PS10) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: refresh profile for the new exporter [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) [12:56:00] (03CR) 10Vivian Rook: [C: 03+2] P:toolforge::prometheus: remove paws jobs [puppet] - 10https://gerrit.wikimedia.org/r/778673 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [12:56:02] (03PS7) 10Arturo Borrero Gonzalez: aptrepo: introduce bullseye-wikimedia/component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) [12:57:44] (03CR) 10Vivian Rook: [C: 03+2] "Oh this has parent reviews. I haven't seen this before. If I "submit including parents" will that pull 778622 into the merge?" [puppet] - 10https://gerrit.wikimedia.org/r/778673 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220412T1300). [13:00:04] Lucas_WMDE and nemo-yiannis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] o/ [13:00:14] o/ [13:00:29] if someone else could start deploying nemo-yiannis’ patch, that would be great, I’m not quite free yet [13:01:21] i can deploy today [13:01:33] Lucas_WMDE: should i +2 your backports to save time? [13:01:34] (03CR) 10David Caro: P:toolforge::prometheus: remove paws jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778673 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [13:01:56] (03CR) 10Urbanecm: [C: 03+2] Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) (owner: 10Jgiannelos) [13:02:06] nemo-yiannis: hello! [13:02:14] hey [13:02:40] (03Merged) 10jenkins-bot: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) (owner: 10Jgiannelos) [13:03:28] nemo-yiannis: i pulled the patch to mwdebug1001. can you have a look? [13:03:29] urbanecm: sure, thanks :) [13:03:44] (03CR) 10Urbanecm: [C: 03+2] Don’t use session-consistent connections in UnexpectedUnconnectedPagePrimer [extensions/Wikibase] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/779109 (owner: 10Lucas Werkmeister (WMDE)) [13:03:49] (03CR) 10Urbanecm: [C: 03+2] Don’t use session-consistent connections in UnexpectedUnconnectedPagePrimer [extensions/Wikibase] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/779110 (owner: 10Lucas Werkmeister (WMDE)) [13:03:54] the affected code doesn’t run on web requests anyways, just preparing for future maintenance script runs [13:03:55] done :) [13:03:58] thanks :) [13:05:28] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/34802/" [puppet] - 10https://gerrit.wikimedia.org/r/778504 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [13:05:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:05:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:16] urbanecm: looks ok [13:07:21] syncing [13:07:23] thanks [13:07:54] I added one config change to the window, hope thats ok [13:07:59] zabe: absolutely [13:08:33] (03PS4) 10Urbanecm: Migrate $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778667 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:08:33] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 572e62140340501849678a0425be29ed0b75fabb: Remove unused wgKartographerDfltStyle after tegola roll out (T298249) (duration: 00m 52s) [13:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:37] (03CR) 10Urbanecm: [C: 03+2] Migrate $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778667 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:08:38] T298249: Cleanup kartographer default styles in mediawiki config - https://phabricator.wikimedia.org/T298249 [13:08:51] nemo-yiannis: should be live. anything else? [13:08:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24509 and previous config saved to /var/cache/conftool/dbconfig/20220412-130855-ladsgroup.json [13:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:20] (03Merged) 10jenkins-bot: Migrate $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778667 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:09:23] urbanecm: i think thats it, thanks! [13:09:28] no problem [13:09:41] urbanecm, I cant test mine on mwdebug [13:09:47] why not? [13:10:08] (is that used only in the diff tests as of today?) [13:10:23] (03CR) 10Vivian Rook: [C: 03+2] P:wmcs::paws::prometheus: add kubernetes prometheus jobs [puppet] - 10https://gerrit.wikimedia.org/r/778622 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [13:11:14] it is a script, when I understand it correctly [13:11:34] ok [13:12:13] well, running it works [13:12:14] syncing [13:13:30] !log urbanecm@deploy1002 Synchronized multiversion/buildConfigCache.php: 8b74b085704a75bd52d490fecfa8a8996f17ce89: Migrate $wmfConfigDir to $configDir (T45956) (duration: 00m 51s) [13:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:36] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [13:13:39] zabe: it's live [13:13:41] anything else? [13:13:53] no, thanks :) [13:13:57] no problem [13:14:11] Lucas_WMDE: I think you can go ahead when free (and it merges) [13:15:26] alright, I’m back :) [13:15:41] welcome back :) [13:15:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:16:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:13] and yes, I can't see it being used on production (in codesearch) [13:21:00] (03Merged) 10jenkins-bot: Don’t use session-consistent connections in UnexpectedUnconnectedPagePrimer [extensions/Wikibase] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/779109 (owner: 10Lucas Werkmeister (WMDE)) [13:21:03] (03Merged) 10jenkins-bot: Don’t use session-consistent connections in UnexpectedUnconnectedPagePrimer [extensions/Wikibase] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/779110 (owner: 10Lucas Werkmeister (WMDE)) [13:21:05] yay [13:21:27] lots of things happening in the git fetch [13:21:49] but only one new commit on the branch, as expected [13:23:50] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.6/extensions/Wikibase/client/includes/Store/Sql/UnexpectedUnconnectedPagePrimer.php: Backport: [[gerrit:779109|Don’t use session-consistent connections in UnexpectedUnconnectedPagePrimer]] (duration: 00m 57s) [13:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24510 and previous config saved to /var/cache/conftool/dbconfig/20220412-132400-ladsgroup.json [13:24:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [13:24:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [13:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:12] wmf.7 doesn’t exist yet, so I guess there’s nothing to sync there [13:24:21] train will just happen later today [13:25:59] !log UTC afternoon backport window done [13:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:12] (03PS2) 10Sbisson: Enable Wikistories on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773334 (https://phabricator.wikimedia.org/T303004) [13:26:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:26:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:44] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wcqs2001.codfw.wmnet [13:28:44] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wcqs2001.codfw.wmnet [13:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:14] (03CR) 10Sbisson: Enable Wikistories on enwiki beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773334 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [13:31:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:31:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:08] !log Adding loopback4 filter to lo0.0 interface ingress lsw1-e1-eqiad T304553 [13:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:12] T304553: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 [13:38:58] 10SRE, 10Traffic, 10Patch-For-Review: Update certspotter - https://phabricator.wikimedia.org/T204993 (10ssingh) [13:39:07] 10SRE, 10Acme-chief, 10Traffic: Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes - https://phabricator.wikimedia.org/T204994 (10ssingh) [13:41:19] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [13:42:55] (03CR) 10Urbanecm: [C: 03+1] Enable Wikistories on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773334 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [13:44:09] 10SRE, 10SRE-OnFire, 10observability, 10I18n: Internationalization (i18n) & localization (l10n) of www.wikimediastatus.net - https://phabricator.wikimedia.org/T305896 (10Reedy) [13:44:20] 10SRE-swift-storage: swift-ring deploys should rsync TARGETS to puppet volatile - https://phabricator.wikimedia.org/T293438 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Closing this, since it relates to the old ring management infrastructure, obsoleted by T265117. [13:58:02] 10SRE-swift-storage: swift wmf/rewrite.py middleware broken on bullseye (and its test suite doesn't work either) - https://phabricator.wikimedia.org/T305942 (10MatthewVernon) [13:58:22] 10SRE-swift-storage: swift wmf/rewrite.py middleware broken on bullseye (and its test suite doesn't work either) - https://phabricator.wikimedia.org/T305942 (10MatthewVernon) [13:58:24] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [13:58:42] 10SRE-swift-storage: swift wmf/rewrite.py middleware broken on bullseye (and its test suite doesn't work either) - https://phabricator.wikimedia.org/T305942 (10MatthewVernon) p:05Triage→03High [14:00:12] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) p:05Medium→03High [14:00:38] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) [14:00:40] (03PS1) 10Majavah: P:toolforge::prometheus: simplify prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716) [14:00:41] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [14:01:47] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34803/console" [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [14:02:19] (03CR) 10jerkins-bot: [V: 04-1] P:toolforge::prometheus: simplify prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [14:02:53] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) p:05Triage→03High [14:06:19] (03PS2) 10Majavah: P:toolforge::prometheus: simplify prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716) [14:10:31] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:17:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [14:17:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [14:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:15] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10TheDJ) While I'm aware that the Atlassian privacy policy applies..... [14:20:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for eigyan - https://phabricator.wikimedia.org/T305948 (10eigyan) [14:25:09] 10SRE, 10ops-eqiad: mw1308 - internal IPMI error - mgmt / DRAC problem - https://phabricator.wikimedia.org/T305741 (10Cmjohnson) @Dzahn this most likely will need to be powered off for 30 secs. Can I do this anytime? I want to do today if possible. [14:25:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for eigyan - https://phabricator.wikimedia.org/T305948 (10eigyan) @mepps Greetings Maggie, tagging you for later reference once you return from sick leave :) [14:29:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10eigyan) [14:29:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudstore1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudstore1011.mgmt.eqiad.wmnet with reboot policy FORCED [14:30:17] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] sslcert: migrate update-ocsp-all cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778492 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:47] (03CR) 10Vgutierrez: [C: 03+1] service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332 (owner: 10Volans) [14:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:33:01] 10SRE, 10Wikimedia-Mailing-lists: Figure out if we can remove legacy domain support for mailing lists - https://phabricator.wikimedia.org/T280472 (10jhathaway) 05Open→03Resolved Emails are now being rejected, gmail present rejections like this: {F35048198} [14:36:37] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10TheDJ) Also feature request for them, as we now have a contract: R... [14:41:42] (03PS3) 10Vivian Rook: add chunkeddriver.py.patch to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) [14:43:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10jcrespo) Hi, I am Jaime from the SRE team, trying to process your request, as I am this week on clinic duty. :-) Apologies, but I am unsure of who is the actual... [14:43:49] jouncebot nowandnext [14:43:49] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [14:43:49] In 1 hour(s) and 16 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220412T1600) [14:45:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudstore1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10jcrespo) Ah, our messages crossed, now I do know who you are, after the title edit :-) May I still ask you about the email and the employee/volunteer question? [14:46:42] !log hnowlan@deploy1002 Started deploy [restbase/deploy@31675fb]: add guw.wikipedia.org [14:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:05] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@31675fb]: add guw.wikipedia.org (duration: 00m 22s) [14:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:14] (03PS4) 10Vivian Rook: add chunkeddriver.py.patch to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) [14:47:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudstore1011.mgmt.eqiad.wmnet with reboot policy FORCED [14:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:05] (03CR) 10Ahmon Dancy: [C: 03+2] Temporarily undeprecate EditPage::$textbox2 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/778641 (https://phabricator.wikimedia.org/T305028) (owner: 10Thiemo Kreuz (WMDE)) [14:49:05] !log hnowlan@deploy1002 Started deploy [restbase/deploy@627f7d7]: add guw.wikipedia.org [14:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:18] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10ssingh) >>! In T305423#7841765, @wiki_willy wrote: > Hi @ssingh - since this server is out of warranty and due to be refreshed in a few quarters, do you still want us to purchase a replac... [14:55:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Cmjohnson) [14:57:21] (03CR) 10Ahmon Dancy: [C: 03+1] Add all members of the ops group to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [14:58:15] (03CR) 10Ahmon Dancy: [C: 03+1] Switch default group for Kubernetes credentials files to deployment [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [14:59:18] (03PS1) 10Razzi: clouddb: depool clouddb1013-1016 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/779483 (https://phabricator.wikimedia.org/T299480) [15:01:57] (03PS1) 10Cmjohnson: Adding new hosts cloudstore101[01] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/779484 (https://phabricator.wikimedia.org/T302981) [15:02:14] (03PS7) 10Volans: service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332 [15:02:25] (03PS1) 10Volans: alertmanager: fix and improve donwtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/779485 (https://phabricator.wikimedia.org/T293209) [15:03:30] (03CR) 10Razzi: [C: 03+2] clouddb: depool clouddb1013-1016 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/779483 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [15:03:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:03:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:00] (03PS3) 10Cmjohnson: Adding new elastic servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/778329 (https://phabricator.wikimedia.org/T299609) [15:04:19] (03CR) 10Cmjohnson: [C: 03+2] Adding new hosts cloudstore101[01] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/779484 (https://phabricator.wikimedia.org/T302981) (owner: 10Cmjohnson) [15:04:41] (03PS4) 10Cmjohnson: Adding new elastic servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/778329 (https://phabricator.wikimedia.org/T299609) [15:04:50] (03Merged) 10jenkins-bot: Temporarily undeprecate EditPage::$textbox2 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/778641 (https://phabricator.wikimedia.org/T305028) (owner: 10Thiemo Kreuz (WMDE)) [15:04:58] (03CR) 10Volans: [C: 03+2] service::catalog: add Spicerack comments [puppet] - 10https://gerrit.wikimedia.org/r/778332 (owner: 10Volans) [15:05:01] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@627f7d7]: add guw.wikipedia.org (duration: 15m 56s) [15:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:37] (03CR) 10Cmjohnson: [C: 03+2] Adding new elastic servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/778329 (https://phabricator.wikimedia.org/T299609) (owner: 10Cmjohnson) [15:06:42] !log dancy@deploy1002 Synchronized php-1.39.0-wmf.6/includes/EditPage.php: Backport: [[gerrit:778641|Temporarily undeprecate EditPage::$textbox2 (T305028)]] (duration: 00m 52s) [15:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:46] T305028: PHP Deprecated: Use of EditPage::$textbox2 was deprecated in MediaWiki 1.38. [Called from TwoColConflict\Hooks\TwoColConflictHooks::onEditPageBeforeConflictDiff] - https://phabricator.wikimedia.org/T305028 [15:06:58] (03PS5) 10Vivian Rook: add chunkeddriver.py.patch to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) [15:07:05] (03CR) 10Andrew Bogott: [C: 03+1] "Yep! We still have the nfs::primary hosts for a bit longer but the ::secondary hosts are out of service." [puppet] - 10https://gerrit.wikimedia.org/r/779446 (https://phabricator.wikimedia.org/T291405) (owner: 10David Caro) [15:07:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:07:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:01] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:08:32] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrade clouddb1013 to bullseye [15:08:34] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1013.eqiad.wmnet with reason: Upgrade clouddb1013 to bullseye [15:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:41] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=cp5002.eqsin.wmnet [15:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:51] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1013.eqiad.wmnet with OS bullseye [15:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:33] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [15:17:13] PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:17:29] PROBLEM - BFD status on lsw1-f3-eqiad.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:17:53] PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:18:11] (03CR) 10David Caro: [C: 03+2] wmcs: Remove unused role wmcs::nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/779446 (https://phabricator.wikimedia.org/T291405) (owner: 10David Caro) [15:18:31] PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: CRIT: Down: 5 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:19:05] PROBLEM - BFD status on lsw1-f2-eqiad.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:19:07] PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:19:19] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:39] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:51] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad, AS64810/IPv4: Active - evpn_switches_eqiad, AS64810/IPv4: Active - evpn_switches_eqiad, AS64810/IPv4: Active - evpn_switches_eqiad, AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudstore1010.wikimedia.org with OS bullseye [15:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:07] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:20:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudstore1010.wikimedia.or... [15:20:53] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:19] PROBLEM - BGP status on lsw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:21:50] (03CR) 10Vivian Rook: "https://puppet-compiler.wmflabs.org/pcc-worker1003/34807/" [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [15:22:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Cmjohnson) [15:22:59] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudstore1010.wikimedia.org with OS bullseye [15:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudstore1010.wikimedia.org wi... [15:23:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudstore1010.wikimedia.org with OS bullseye [15:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:52] RECOVERY - BGP status on lsw1-f1-eqiad.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudstore1010.wikimedia.or... [15:24:22] RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:24:26] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:37] (03PS2) 10Volans: alertmanager: fix and improve donwtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/779485 (https://phabricator.wikimedia.org/T293209) [15:25:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: introduce bullseye-wikimedia/component/prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/778488 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [15:26:06] (03PS1) 10Razzi: netboot: add clouddb partitioned as database [puppet] - 10https://gerrit.wikimedia.org/r/779488 (https://phabricator.wikimedia.org/T299480) [15:26:16] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:26:45] (03CR) 10Andrew Bogott: [C: 03+1] add chunkeddriver.py.patch to wallaby (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [15:27:20] (03PS2) 10Razzi: netboot: add clouddb partitioned as database [puppet] - 10https://gerrit.wikimedia.org/r/779488 (https://phabricator.wikimedia.org/T299480) [15:27:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudstore1011.wikimedia.org with OS bullseye [15:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudstore1011.wikimedia.or... [15:28:13] (03PS3) 10Majavah: P:wmcs::prometheus: use a single entry for openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/768747 (https://phabricator.wikimedia.org/T302178) [15:28:34] RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:29:19] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34808/console" [puppet] - 10https://gerrit.wikimedia.org/r/768747 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [15:29:46] (03PS3) 10Razzi: netboot: add clouddb partitioned as database [puppet] - 10https://gerrit.wikimedia.org/r/779488 (https://phabricator.wikimedia.org/T299480) [15:30:46] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:32:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Jclark-ctr) @RobH Both host dac cables have been corrected moved dumpsdata1007's dac cable back to port 1 moved dumpsdata1006's dac cable back to port 1 [15:32:49] (03CR) 10Arturo Borrero Gonzalez: P:wmcs::prometheus: use a single entry for openstack-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768747 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [15:34:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2019.codfw.wmnet [15:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:56] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2017.codfw.wmnet [15:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2020.codfw.wmnet [15:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:23] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1013.eqiad.wmnet with OS bullseye [15:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:52] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2026.codfw.wmnet [15:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:55] (03PS4) 10Majavah: P:wmcs::prometheus: use a single entry for openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/768747 (https://phabricator.wikimedia.org/T302178) [15:37:30] RECOVERY - BFD status on lsw1-f2-eqiad.mgmt is OK: OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:37:38] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:37:43] (03CR) 10Majavah: P:wmcs::prometheus: use a single entry for openstack-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768747 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [15:37:56] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:02] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:38:04] RECOVERY - BFD status on lsw1-f3-eqiad.mgmt is OK: OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:41:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::prometheus: use a single entry for openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/768747 (https://phabricator.wikimedia.org/T302178) (owner: 10Majavah) [15:42:30] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10hnowlan) For new hosts, it seems the `reuse` profile won't work as it expects an existing array. The non-reuse `cassandrahosts-3ssd-jbod` config is required. [15:43:12] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:44:59] !log removed a bunch of old src & binary packages for prometheus-openstack-exporter (T302178) [15:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:04] T302178: prometheus-openstack-exporter No module named 'urlparse' - https://phabricator.wikimedia.org/T302178 [15:45:23] (03PS1) 10Hnowlan: install_server: use non-reuse partition for new host [puppet] - 10https://gerrit.wikimedia.org/r/779494 (https://phabricator.wikimedia.org/T301399) [15:46:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudstore1010.wikimedia.org with OS bullseye [15:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudstore1010.wikimedia.org wi... [15:47:32] RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:47:46] RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:47:56] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10dancy) [15:48:27] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10hnowlan) >>! In T305469#7845959, @Papaul wrote: > @hnowlan will it be possible to get me restbase2021 offline on April 14th at 9:30am CT? > > thanks. Yep, t... [15:49:47] !log aborrero@apt1001:~ $ sudo -i reprepro -C component/prometheus-openstack-exporter includedeb bullseye-wikimedia ${PWD}/prometheus-openstack-exporter_1.5.0-1_amd64.deb (T302178) [15:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [15:51:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [15:51:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24512 and previous config saved to /var/cache/conftool/dbconfig/20220412-155143-ladsgroup.json [15:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:53:34] (03PS1) 10Jsn.sherman: Update enwiki surveys on beta for QA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779499 (https://phabricator.wikimedia.org/T294363) [15:54:23] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10dancy) [15:56:18] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:36] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10dancy) >>! In T303857#7837967, @Joe wrote: >>>! In T303857#7818920, @dancy wrote: >> I have confirmed that... [16:00:05] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220412T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:23] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10dancy) Pinging @akosiaris [16:01:03] (03CR) 10Volans: [C: 04-1] "There is a conflict, details inline" [puppet] - 10https://gerrit.wikimedia.org/r/779488 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [16:01:31] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:14] (03CR) 10CDanis: [C: 03+1] alertmanager: fix and improve donwtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/779485 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:03:24] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10akosiaris) 05Resolved→03Open Thanks for the ping, wouldn't have seen it otherwise. Re-opening and I 'll... [16:03:35] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10jcrespo) Hey, @dancy - I may not be able to help you directly but I may be able to find someone who can. Ho... [16:03:40] (03CR) 10Papaul: [V: 03+1] install_server: use non-reuse partition for new host [puppet] - 10https://gerrit.wikimedia.org/r/779494 (https://phabricator.wikimedia.org/T301399) (owner: 10Hnowlan) [16:03:57] (03CR) 10Hnowlan: [C: 03+2] install_server: use non-reuse partition for new host [puppet] - 10https://gerrit.wikimedia.org/r/779494 (https://phabricator.wikimedia.org/T301399) (owner: 10Hnowlan) [16:03:58] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:04:30] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10dancy) >>! In T303857#7848600, @jcrespo wrote: > Hey, @dancy - I may not be able to help you directly but I... [16:05:11] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10hnowlan) a:05hnowlan→03Papaul [16:06:48] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) @hnowlan " use non-reuse partition for new host" so if you want to re-image this host later down the road you will have to change it agai... [16:07:55] (03PS4) 10Razzi: netboot: add clouddb partitioned as database [puppet] - 10https://gerrit.wikimedia.org/r/779488 (https://phabricator.wikimedia.org/T299480) [16:08:19] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1308.eqiad.wmnet [16:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:52] (03PS5) 10Razzi: netboot: add clouddb1013 partitioned as database [puppet] - 10https://gerrit.wikimedia.org/r/779488 (https://phabricator.wikimedia.org/T299480) [16:08:55] 10SRE, 10ops-eqiad: mw1308 - internal IPMI error - mgmt / DRAC problem - https://phabricator.wikimedia.org/T305741 (10Dzahn) @Cmjohnson I just depooled that server. You can do it now anytime. Hope it wasn't too late. [16:09:47] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10hnowlan) >>! In T301399#7848614, @Papaul wrote: > @hnowlan > " use non-reuse partition for new host" > so if you want to re-image this host later... [16:10:58] 10SRE, 10Data-Engineering, 10LDAP-Access-Requests: Request to add user gmodena to analytics-research-admins group - https://phabricator.wikimedia.org/T305880 (10jcrespo) @gmodena Did the access work? [16:12:24] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:41] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Request to add user gmodena to analytics-research-admins group - https://phabricator.wikimedia.org/T305880 (10Zabe) [16:14:12] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for vyuen - https://phabricator.wikimedia.org/T305934 (10jcrespo) a:03jcrespo Hi, I will add you to the wmf ldap group, as requested. [16:14:21] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for vyuen - https://phabricator.wikimedia.org/T305934 (10jcrespo) p:05Triage→03High [16:15:54] (03PS6) 10Razzi: netboot: add clouddb1013 partitioned as database [puppet] - 10https://gerrit.wikimedia.org/r/779488 (https://phabricator.wikimedia.org/T299480) [16:16:57] 10SRE, 10Infrastructure-Foundations, 10Mail: Exim emitting warnings about tainted filenames - https://phabricator.wikimedia.org/T305962 (10jhathaway) [16:21:54] (03PS1) 10Jcrespo: admin: Add vyuen to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/779503 (https://phabricator.wikimedia.org/T305934) [16:22:20] (03PS1) 10JHathaway: mx: use $domain_data rather than $domain for aliases [puppet] - 10https://gerrit.wikimedia.org/r/779504 (https://phabricator.wikimedia.org/T305962) [16:23:14] (03CR) 10Volans: [C: 03+1] "I don't have the context to be sure that the recipe is the correct one, but the patch can't create issues to other hosts, so ok for me to " [puppet] - 10https://gerrit.wikimedia.org/r/779488 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [16:23:25] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34809/console" [puppet] - 10https://gerrit.wikimedia.org/r/779504 (https://phabricator.wikimedia.org/T305962) (owner: 10JHathaway) [16:23:51] (03PS1) 10Majavah: hieradata: pcc: update puppetmaster for clouddb-services [puppet] - 10https://gerrit.wikimedia.org/r/779505 [16:24:06] (03CR) 10Volans: [C: 03+2] alertmanager: fix and improve donwtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/779485 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:25:14] (03CR) 10JHathaway: "pcc output, https://puppet-compiler.wmflabs.org/pcc-worker1003/34809/" [puppet] - 10https://gerrit.wikimedia.org/r/779504 (https://phabricator.wikimedia.org/T305962) (owner: 10JHathaway) [16:25:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: pcc: update puppetmaster for clouddb-services [puppet] - 10https://gerrit.wikimedia.org/r/779505 (owner: 10Majavah) [16:28:12] (03CR) 10Jcrespo: [C: 03+2] admin: Add vyuen to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/779503 (https://phabricator.wikimedia.org/T305934) (owner: 10Jcrespo) [16:30:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2027.codfw.wmnet with OS buster [16:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:24] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2027.codfw.wmnet with OS buster [16:32:10] (03Merged) 10jenkins-bot: alertmanager: fix and improve donwtime [software/spicerack] - 10https://gerrit.wikimedia.org/r/779485 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:32:18] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for vyuen - https://phabricator.wikimedia.org/T305934 (10jcrespo) The change has been applied: https://ldap.toolforge.org/user/vyuen You should have now (of in a few minutes) access to superset. Please check this is correct and let us know it worked for you.... [16:33:20] (03CR) 10Razzi: [C: 03+2] netboot: add clouddb1013 partitioned as database [puppet] - 10https://gerrit.wikimedia.org/r/779488 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [16:33:57] !log gitlab: pausing runner-1013, then will remove it and create new bullseye runner to replace it [16:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:38] (03CR) 10Andrew Bogott: "This is definitely what I'm after! A few questions:" [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:42:56] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1013.eqiad.wmnet with OS bullseye [16:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:33] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10KFrancis) @jcrespo I am confirming Grant Zabe was approved for access. Thanks! [16:49:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24513 and previous config saved to /var/cache/conftool/dbconfig/20220412-164907-ladsgroup.json [16:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:49:19] (03PS1) 10Cathal Mooney: Modify CR loopback filter and add VRF-specific filter for switches [homer/public] - 10https://gerrit.wikimedia.org/r/779510 (https://phabricator.wikimedia.org/T304553) [16:49:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2027.codfw.wmnet with reason: host reimage [16:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:45] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for vyuen - https://phabricator.wikimedia.org/T305934 (10vyuen) Thank you! [16:50:50] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10Dzahn) added Zabe to the doc (I had sharing privs because I originally created it). I used the exact email address found on the "NDA" google doc. [16:52:14] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10Dzahn) 05Open→03Resolved a:05KFrancis→03None @Zabe You should have email and this should be resolved. [16:53:49] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1013.eqiad.wmnet with reason: host reimage [16:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2027.codfw.wmnet with reason: host reimage [16:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:53] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10Zabe) Yes, thanks! [16:57:06] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1013.eqiad.wmnet with reason: host reimage [16:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:01] (03CR) 10Hnowlan: [C: 03+1] profile: issue warnings for check_mw_versions [puppet] - 10https://gerrit.wikimedia.org/r/767729 (https://phabricator.wikimedia.org/T302832) (owner: 10Filippo Giunchedi) [17:02:38] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for vyuen - https://phabricator.wikimedia.org/T305934 (10jcrespo) 05Open→03Resolved Assuming it worked as intended- please reopen if not. Or you can create a separate task for additional access requests. [17:03:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2027.codfw.wmnet with OS buster [17:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:03] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2027.codfw.wmnet with OS buster completed: - restbase2027 (**PASS*... [17:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24514 and previous config saved to /var/cache/conftool/dbconfig/20220412-170412-ladsgroup.json [17:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:20] (03CR) 10Cathal Mooney: [C: 03+2] Modify CR loopback filter and add VRF-specific filter for switches [homer/public] - 10https://gerrit.wikimedia.org/r/779510 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney) [17:05:09] (03Merged) 10jenkins-bot: Modify CR loopback filter and add VRF-specific filter for switches [homer/public] - 10https://gerrit.wikimedia.org/r/779510 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney) [17:05:36] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: introduce some caching logic in the wrapper [puppet] - 10https://gerrit.wikimedia.org/r/779515 (https://phabricator.wikimedia.org/T302178) [17:05:38] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: only run it on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) [17:10:38] 10SRE, 10Phabricator: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644 (10Dzahn) Given the underlying issue has been resolved and we see no performance issues with Phabricator that I'm aware of I would tend to NOT do this and keep using Apache just like we do for tons o... [17:10:58] 10SRE, 10Phabricator, 10serviceops-radar: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644 (10Dzahn) [17:12:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 (10cmooney) On the EVPN devices filtering needs to be defined on each 'unit' of the loopback interface, i.e. the default one "lo0.0" in th... [17:12:51] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10Dzahn) [17:13:25] (03PS2) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: introduce some caching logic in the wrapper [puppet] - 10https://gerrit.wikimedia.org/r/779515 (https://phabricator.wikimedia.org/T302178) [17:13:27] (03PS2) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: only run it on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) [17:16:19] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [17:16:56] 10SRE, 10Phabricator, 10serviceops-radar: Switch phabricator from using apache to nginx - https://phabricator.wikimedia.org/T185644 (10jcrespo) +1 to set it as resolved. I believe there could be been some bugs, but also performance issues in the past were due to the old search architecture + some misbehaving... [17:19:04] (03CR) 10Majavah: [C: 04-1] "nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/779515 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [17:19:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24515 and previous config saved to /var/cache/conftool/dbconfig/20220412-171917-ladsgroup.json [17:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:11] (03PS3) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: introduce some caching logic in the wrapper [puppet] - 10https://gerrit.wikimedia.org/r/779515 (https://phabricator.wikimedia.org/T302178) [17:23:13] (03PS3) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: only run it on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) [17:23:35] (03CR) 10Majavah: [C: 04-1] "This currently will not remove the exporter service after changing the primary keystone server. Also, I wonder if it'd be better to run it" [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [17:23:46] (03CR) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: introduce some caching logic in the wrapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779515 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [17:34:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24516 and previous config saved to /var/cache/conftool/dbconfig/20220412-173422-ladsgroup.json [17:34:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [17:34:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [17:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24517 and previous config saved to /var/cache/conftool/dbconfig/20220412-173430-ladsgroup.json [17:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10jcrespo) [17:43:43] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) [17:44:30] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) 05Open→03Resolved @hnowlan complete [17:49:39] jouncebot now [17:49:39] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [17:56:20] (03CR) 10RLazarus: [C: 03+1] "Good idea, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/779444 (https://phabricator.wikimedia.org/T305581) (owner: 10Volans) [17:57:32] (03CR) 10RLazarus: [V: 03+1 C: 03+2] sretest: Uninstall external_clouds_vendors [puppet] - 10https://gerrit.wikimedia.org/r/779145 (https://phabricator.wikimedia.org/T270391) (owner: 10RLazarus) [17:57:40] rzl: how do you suggest to proceed with this one? ^^^should I merge and force a run to ensure both output and that it will trigger a change on all files? [17:58:15] (03PS1) 10Eigyan: [wmf-config] Undeploy safety survey from PT wiki - PRODUCTION [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779526 [17:58:29] (03CR) 10RLazarus: [C: 03+2] external_clouds_vendors: Remove migration shim for T305581 [puppet] - 10https://gerrit.wikimedia.org/r/779149 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [17:59:29] volans: oh sorry, I assumed the highlight in this window was from wikibugs :P yeah, that sounds reasonable [17:59:57] imo it's fine to either force a run or wait for midnight UTC -- but if you want to be able to supervise and follow up, forcing sounds good [18:00:04] dancy and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220412T1800). [18:00:06] rzl: no prob :) thanks, will do [18:00:20] o/ [18:00:23] (03CR) 10Volans: [C: 03+2] cloud vendors: force yaml output format [puppet] - 10https://gerrit.wikimedia.org/r/779444 (https://phabricator.wikimedia.org/T305581) (owner: 10Volans) [18:00:37] (03PS2) 10Volans: cloud vendors: force yaml output format [puppet] - 10https://gerrit.wikimedia.org/r/779444 (https://phabricator.wikimedia.org/T305581) [18:00:49] OK for me to proceed with the train? [18:01:18] tracks are clear, afaik [18:01:27] Thx [18:01:57] (03CR) 10Eigyan: "@ This looks good as well hopefully indexing will agree this time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779499 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [18:02:34] hmm... that's beta config. alright [18:04:27] (03PS3) 10Zabe: Migrate $wmfUdp2logDest to $wmgUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776258 (https://phabricator.wikimedia.org/T45956) [18:08:24] (03PS2) 10Eigyan: [wmf-config] Undeploy safety survey from PT wiki - PRODUCTION [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779526 (https://phabricator.wikimedia.org/T305855) [18:14:26] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779526 (https://phabricator.wikimedia.org/T305855) (owner: 10Eigyan) [18:15:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:15:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:39] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779526 (https://phabricator.wikimedia.org/T305855) (owner: 10Eigyan) [18:24:29] (03CR) 10Herron: "Thanks for putting this together, I'm starting to understand better the approach that you described at the meeting. Please see comments i" [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [18:27:26] (03PS1) 10Dzahn: DHCP: make doh and durum hosts use the bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) [18:27:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24520 and previous config saved to /var/cache/conftool/dbconfig/20220412-182747-ladsgroup.json [18:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:29:36] mutante: <3 [18:30:05] just today morning: [18:30:09] 3221 12/04/22 11:48:31 vim modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 [18:30:12] :) [18:30:12] (03PS2) 10Dzahn: DHCP: make doh and durum hosts use the bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) [18:31:57] sukhe: ;) aww, this always happens I guess [18:32:17] the whole point was I wanted to say "regardless if cookbook or not" :) [18:32:25] I didn't get far enough to make the patch as I got busy with other stuff, but yes, you read my mind for sure! [18:32:40] mutante: for now, we will be doing it manually [18:32:46] but yeah, regardless of the cookbook [18:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:33:02] sukhe: ok, sounds good to me [18:35:33] (03CR) 10Ssingh: [C: 03+1] "Thank you kindly for the patch! <3" [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn) [18:36:18] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) [18:40:24] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.39.0-wmf.7 refs T305213 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779535 [18:40:26] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.39.0-wmf.7 refs T305213 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779535 (owner: 10Ahmon Dancy) [18:41:08] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.7 refs T305213 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779535 (owner: 10Ahmon Dancy) [18:42:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24521 and previous config saved to /var/cache/conftool/dbconfig/20220412-184252-ladsgroup.json [18:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:55] (03PS1) 10Daniel Kinzler: Update for daniel's bash environment [puppet] - 10https://gerrit.wikimedia.org/r/779536 [18:44:16] (03CR) 10jerkins-bot: [V: 04-1] Update for daniel's bash environment [puppet] - 10https://gerrit.wikimedia.org/r/779536 (owner: 10Daniel Kinzler) [18:44:24] (03PS2) 10Daniel Kinzler: Update for daniel's bash environment [puppet] - 10https://gerrit.wikimedia.org/r/779536 [18:45:35] (03PS2) 10Bking: elasticsearch: upgrade codfw to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763483 (https://phabricator.wikimedia.org/T301958) (owner: 10Gehel) [18:45:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:45:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:24] (03CR) 10Bking: [C: 03+2] elasticsearch: upgrade codfw to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763483 (https://phabricator.wikimedia.org/T301958) (owner: 10Gehel) [18:46:35] (03CR) 10Jcrespo: "Let me know when ready so I can merge." [puppet] - 10https://gerrit.wikimedia.org/r/779536 (owner: 10Daniel Kinzler) [18:48:47] 10SRE, 10Security-API-Service, 10Security-Team, 10Performance-Team (Radar), and 2 others: Security API Storage Needs - https://phabricator.wikimedia.org/T301428 (10sbassett) [18:48:50] jynus: ready :) [18:49:10] did you test that it doesn't create an infinite loop :-D? [18:49:44] (03CR) 10Jcrespo: [C: 03+2] Update for daniel's bash environment [puppet] - 10https://gerrit.wikimedia.org/r/779536 (owner: 10Daniel Kinzler) [18:50:14] jynus: I tried to test it, but while i was doing it, puppet kept overriding my files :) [18:50:27] jynus: i tested it locally, using a dummy user on my machine [18:50:52] tell me your bastion so I can manually run it there first [18:51:02] uh [18:51:03] *bastion host [18:51:11] whatever you have on ssh configured [18:51:20] (03CR) 10RLazarus: external_cloud_vendors: Add a known-clients/Googlebot ipblock (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/779157 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [18:51:23] it's ok to temp disable puppet on a test host to test something, as long as it doesn't stay disabled for days [18:51:35] mutante: yeah, but he will not have the rights [18:52:09] true! maybe we could allow it in sudo privs for certain groups [18:52:16] mutante: i don't know how to do that ;) [18:52:18] jynus: bast3005.wikimedia.org [18:52:21] or do it in a case by case basis.. if it's needed now [18:52:40] 2 ending spaces on the script, 8/10 :-) [18:53:46] pfft [18:53:58] how do i tell sublime to strip that? [18:54:25] also you'll probably want to set the files as executable, right now they aren't [18:54:27] don't worry, I am mostly teasing you while I wait for puppet to finish running [18:55:08] duesen: try logging to bast3005 now [18:55:33] taavi: they get sources, so doesn't matter, right? [18:55:48] jynus: "mostly" ;) [18:56:15] afaik at least .profile needs to be executable [18:56:37] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/779539 [18:57:01] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/779539 (owner: 10Volans) [18:57:03] duesen: do you want me to disable puppet on 3005 for testing? [18:57:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24522 and previous config saved to /var/cache/conftool/dbconfig/20220412-185757-ladsgroup.json [18:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:03] PROBLEM - Check systemd state on clouddb1013 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service,wmf-pt-kill@s3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:06] jynus: i just tested, and found a tine bug (by prompt doesn't end with a space) [18:58:10] let me fix that real quick [18:59:23] !log dancy@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.7 refs T305213 [18:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:26] T305213: 1.39.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T305213 [19:00:57] !log dancy@deploy1002 scap failed: TypeError cannot unpack non-iterable NoneType object (duration: 01m 34s) [19:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:17] ^^ That was me testing something out [19:02:49] (03PS1) 10Daniel Kinzler: Fix PS1 in daniel's bash environment [puppet] - 10https://gerrit.wikimedia.org/r/779540 [19:03:32] (03CR) 10Ebernhardson: [C: 03+1] wdqs: activate jvmquake at 300:5 [puppet] - 10https://gerrit.wikimedia.org/r/779440 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:04:23] (03PS2) 10Daniel Kinzler: Fix PS1 in daniel's bash environment [puppet] - 10https://gerrit.wikimedia.org/r/779540 [19:04:56] jynus: ok, should be good now [19:05:11] jynus: new NEW change, i mean [19:05:48] I will merge, but also give you a 7/10 on commit message for not adding a component-colon (admin) [19:05:55] !log dancy@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.7 refs T305213 [19:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:00] T305213: 1.39.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T305213 [19:06:10] !log dancy@deploy1002 deploy-promote aborted: (duration: 01m 09s) [19:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:39] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/779539 (owner: 10Volans) [19:06:41] !log dancy@deploy1002 prep aborted: (duration: 00m 11s) [19:06:42] !log dancy@deploy1002 deploy-promote aborted: (duration: 00m 14s) [19:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:13] jynus: noted. i rarely contribute to this repo. I'll try to remember [19:07:16] !log dancy@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.7 refs T305213 [19:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:20] (03CR) 10Jcrespo: [C: 03+2] Fix PS1 in daniel's bash environment [puppet] - 10https://gerrit.wikimedia.org/r/779540 (owner: 10Daniel Kinzler) [19:07:47] long time we don't talk, btw [19:07:53] hope you are well [19:09:08] (03CR) 10Herron: mx: use $domain_data rather than $domain for aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779504 (https://phabricator.wikimedia.org/T305962) (owner: 10JHathaway) [19:09:08] jynus: true true! I hope traveling will be possible again soon. [19:09:49] puppet running... [19:09:54] I'm mostly well. Got some dental surgery that wasn't fun last week, still dealing with that... [19:09:59] jynus: how are you doing? [19:10:37] do you often miss the ability to disable puppet? I would actually create a patch/ticket for that kind of thing.. if you say it's worth it and then we have to talk about WHICH hosts.. rather not bastion but more like mwdebug* or people* [19:10:40] better tell you next time we meet :-D [19:11:26] (03PS1) 10Volans: Upstream release v2.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/779541 [19:11:48] (03CR) 10Volans: [C: 03+2] Upstream release v2.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/779541 (owner: 10Volans) [19:13:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24523 and previous config saved to /var/cache/conftool/dbconfig/20220412-191302-ladsgroup.json [19:13:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [19:13:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [19:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24524 and previous config saved to /var/cache/conftool/dbconfig/20220412-191310-ladsgroup.json [19:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:18] duesen: test now on bast3005 [19:13:30] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [19:14:28] jynus: looking good [19:14:50] cool, as you know it may take a while to trasmit to the other hosts [19:15:03] I think I am going to log off for now [19:16:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:16:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:02] !log T295666 Gearing up for rolling upgrade of codfw cirrus to `6.8.23`. Commencing operation shortly. Will be using a batch size of 3 hosts [19:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:06] T295666: Upgrade Cirrus elasticsearch clusters to 6.8.23 - https://phabricator.wikimedia.org/T295666 [19:18:41] ACKNOWLEDGEMENT - Check systemd state on clouddb1013 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service,wmf-pt-kill@s3.service andrew bogott These are broken by T305974 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:21] (03Merged) 10jenkins-bot: Upstream release v2.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/779541 (owner: 10Volans) [19:19:27] (03PS1) 10Ottomata: Allow analytics-research-admins to access deploy_airflow key [puppet] - 10https://gerrit.wikimedia.org/r/779542 (https://phabricator.wikimedia.org/T305880) [19:20:07] (03CR) 10Ottomata: [C: 03+2] Allow analytics-research-admins to access deploy_airflow key [puppet] - 10https://gerrit.wikimedia.org/r/779542 (https://phabricator.wikimedia.org/T305880) (owner: 10Ottomata) [19:21:02] mutante: being able to disable puppet on mwdebug would be useful, yes. [19:21:15] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: Upgrading Elasticsearch to 6.8 in CODFW - bking@cumin1001 - T301958 [19:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:19] T301958: Upgrade Search elasticsearch cluster / codfw to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301958 [19:21:31] mutante: but we'd want some kind of mechanism to re-enable it if someone forgets. [19:21:47] Speaking of timeouts.. why is TMOUT five days per default? [19:21:54] duesen: there is icinga alerting if it's disabled for too long but not automatic re-enabling [19:21:58] I mean, you could just not have a timeout, if you set it to five days... [19:22:17] icinga should be sufficient I guess [19:23:26] duesen: you got it (that I will make the ticket for it). yea, not sure if it's visible enough, maybe something smarter for just mwdebug* [19:23:59] some RED in the MOTD banner maybe :p [19:24:56] tput ftw ;) [19:25:27] the thing is.. this used to be an alert right here like "CRIT - puppet not running since 2 days on mwdebug1003" [19:25:44] but then we thought those are way too spammy and summarized them [19:26:16] and now it's one alert for all the hosts where it happens and to notice it you actively have to go look at icinga web UI.. which is less likely than seeing this while we talk here [19:26:25] so basically..you can't do it right either way [19:28:15] makes me think the real answer is to treat that differently on hosts where we allow/expect more that they are used for testing [19:29:00] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: Exim emitting warnings about tainted filenames - https://phabricator.wikimedia.org/T305962 (10jhathaway) Mailing list discussion, https://www.mail-archive.com/exim-users@exim.org/msg57122.html [19:29:29] then you can argue why do test hosts need monitoring at all..but if you don't then they tend to become 'not like prod' [19:32:22] (03PS1) 10Zabe: Stop setting $wgMultiContentRevisionSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779545 (https://phabricator.wikimedia.org/T231674) [19:33:10] (03PS2) 10Zabe: Stop setting $wgMultiContentRevisionSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779545 (https://phabricator.wikimedia.org/T231674) [19:39:23] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Infrastructure-Foundations (FY2021/2022-Q4), 10SRE Observability (FY2021/2022-Q4): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Quiddity) [19:39:35] !log uploaded spicerack_2.4.1 to apt.wikimedia.org bullseye-wikimedia [19:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:53] taavi, urbanecm: one of you around? [19:40:22] What's up zabe ? [19:41:33] barely [19:45:42] (03PS1) 10Ryan Kemper: elastic: 2060 is in row D, not C [puppet] - 10https://gerrit.wikimedia.org/r/779547 [19:46:45] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10NHillard-WMF) [19:47:37] (03CR) 10Bking: [C: 03+1] elastic: 2060 is in row D, not C [puppet] - 10https://gerrit.wikimedia.org/r/779547 (owner: 10Ryan Kemper) [19:47:46] 10SRE, 10SRE-Access-Requests: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Dzahn) [19:49:10] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10wiki_willy) Thanks @ssingh. Rob's working on sourcing the replacement DIMM, so we should have that sorted out soon, and will keep you in the loop via an adjacent procurement task. Thank... [19:51:14] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) [19:51:38] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10Dzahn) @NHillard-WMF Welcome to WMF. fyi, for normal code review and approvals you should already be set even without this group. I think it's only needed if you want to actually merge code. [19:54:25] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1013.eqiad.wmnet with OS bullseye [19:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:26] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.7 refs T305213 (duration: 49m 10s) [19:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:34] T305213: 1.39.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T305213 [19:59:06] (03PS1) 10Ahmon Dancy: group0 wikis to 1.39.0-wmf.7 refs T305213 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779549 [19:59:08] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.39.0-wmf.7 refs T305213 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779549 (owner: 10Ahmon Dancy) [20:00:02] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10NHillard-WMF) Ah, this is good to know, thanks @Dzahn . And thanks for the welcome as well - nice to meet you! For reference, I am following a guide from my onboarding checklist where it says... [20:00:04] RoanKattouw, Urbanecm, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220412T2000). [20:00:04] eigyan and zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.7 refs T305213 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779549 (owner: 10Ahmon Dancy) [20:01:46] o/ [20:01:48] dancy: fyi B&C window just started :) [20:01:52] I'll wait for the promotion to sync [20:02:00] thx. Should be done in a minute [20:02:04] ack [20:02:31] greetings everyone [20:02:34] Greetings [20:03:13] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.7 refs T305213 [20:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:20] T305213: 1.39.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T305213 [20:03:28] OK I'm done. [20:04:10] RoanKattouw: and Urbanecm: I volunteered to back u 2 up for this window ongoingly -- happy to deploy whenever tho if I mess up as a relatively unseasoned deployer, one of you may need to bail me out [20:04:22] dancy: so, can i take over? [20:04:45] Yep. all yours. I'll watch logs for a bit. [20:04:48] thanks [20:05:13] cjming: thanks for joining the deployers list! do you want to deploy today ? i can help if sth unexpected happens. [20:05:25] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10Dzahn) @NHillard-WMF I see! Thanks for the clarification. Yes, this makes sense. It's probably more about logins to certain web UIs. Then in code review (gerrit.wikimedia.org) it makes the diff... [20:05:32] sure [20:06:03] in that case, leaving it up to you :) [20:06:15] alrighty then - thanks for being on stand by [20:06:21] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10jcrespo) Hey, @NHillard-WMF, welcome! Do you mind registering your @ wikimedia email as the email -or one of the emails- in your wikitech/LDAP account (and verify it) at https://wikitech.wikime... [20:06:22] no problem [20:06:27] (03CR) 10Clare Ming: [C: 03+2] [wmf-config] Undeploy safety survey from PT wiki - PRODUCTION [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779526 (https://phabricator.wikimedia.org/T305855) (owner: 10Eigyan) [20:06:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:06:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:11] (03Merged) 10jenkins-bot: [wmf-config] Undeploy safety survey from PT wiki - PRODUCTION [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779526 (https://phabricator.wikimedia.org/T305855) (owner: 10Eigyan) [20:07:39] ^^awesome [20:08:07] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10Dzahn) @NHillard-WMF Here's one thing you can do. If you already got to the part where you create your user on https://wikitech.wikimedia.org/wiki/Main_Page then try to use that same user you m... [20:08:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24525 and previous config saved to /var/cache/conftool/dbconfig/20220412-200850-ladsgroup.json [20:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:09:06] hi eigyan: is it possible to check your config change on mwdebug1001? [20:09:21] will do [20:09:31] oh whoops - hold on a sec [20:09:54] eigyan: ok - now good [20:12:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:12:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:12] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10NHillard-WMF) @jcrespo Thanks to you as well! >>! In T305978#7849475, @jcrespo wrote: > My guess is you had an existing account with your personal email only, maybe? Yep, this is exactly wha... [20:14:08] (03Abandoned) 10Umherirrender: Remove Mailman3 templates files [puppet] - 10https://gerrit.wikimedia.org/r/755767 (https://phabricator.wikimedia.org/T282308) (owner: 10Umherirrender) [20:16:02] eigyan: how's it looking? ok to sync? [20:16:35] cjming 2min please [20:16:43] np [20:16:51] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:18:07] cjming looks good! all is validated [20:18:14] cool - syncing now [20:19:23] (03PS3) 10Clare Ming: Stop setting $wgMultiContentRevisionSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779545 (https://phabricator.wikimedia.org/T231674) (owner: 10Zabe) [20:20:35] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:779526|[wmf-config] Undeploy safety survey from PT wiki - PRODUCTION (T305855)]] (duration: 02m 11s) [20:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:39] T305855: Undeploy safety survey from PT wiki - PRODUCTION - https://phabricator.wikimedia.org/T305855 [20:21:05] hi zabe: you're up next - are you around? [20:21:14] yes, hi [20:21:42] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@a68eaf2]: (no justification provided) [20:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:44] eigyan: your change should be live [20:21:50] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@a68eaf2]: (no justification provided) (duration: 00m 07s) [20:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:14] (03CR) 10Clare Ming: [C: 03+2] Stop setting $wgMultiContentRevisionSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779545 (https://phabricator.wikimedia.org/T231674) (owner: 10Zabe) [20:22:59] (03Merged) 10jenkins-bot: Stop setting $wgMultiContentRevisionSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779545 (https://phabricator.wikimedia.org/T231674) (owner: 10Zabe) [20:23:01] !log milimetric@deploy1002 Started deploy [airflow-dags/analytics@a68eaf2]: Fixes date format in path to dumps files [20:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:09] !log milimetric@deploy1002 Finished deploy [airflow-dags/analytics@a68eaf2]: Fixes date format in path to dumps files (duration: 00m 07s) [20:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24527 and previous config saved to /var/cache/conftool/dbconfig/20220412-202355-ladsgroup.json [20:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:58] zabe: are you able to check on mwdebug1001? [20:25:35] cjming, nothing seems to explode and logstash is clear, I don't think I can test any further [20:25:48] cjming: can you please let me know once you finish with the B&C patches? I need to fix sth in prod too :) [20:25:56] sgtm - I will sync then [20:26:01] urbanecm: sure thing [20:27:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:27:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:52] !log cjming@deploy1002 Synchronized wmf-config: Config: [[gerrit:779545|Stop setting $wgMultiContentRevisionSchemaMigrationStage (T231674)]] (duration: 01m 33s) [20:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:56] T231674: [Epic] Remove support for writing to the pre-MCR schema - https://phabricator.wikimedia.org/T231674 [20:28:03] zabe: your update is live [20:28:13] thanks :) [20:29:50] not sure if proper protocol is to hang out a bit more or if scheduled patches are done, that it's ok to close the B&C window? [20:30:53] RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.747e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [20:31:53] cjming: probably can close once urbanecm has done what he wants [20:32:03] cjming: generally speaking you should be available "for a while" after B&C [20:32:21] but you can !log the closure immediately (it's there to let people know they can start fiddling with MW prod again) [20:32:33] !log end of UTC late backport & config window [20:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:32:41] urbanecm: all yours [20:32:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:43] thanks [20:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:57] Hey dancy - looks like the sec patch for T226212 got dropped on wmf.6. See the bug at T305982. It's still on deployment and applies fine to CentralAuth. Ok for me to scap that out? [20:34:11] Yes please! [20:34:54] urbanecm: ^ [20:34:54] Ok, will do [20:35:26] whoops, am I bumping heads with another deploy attempt? urbanecm? [20:36:30] He said he wanted to know when cjming was done but no idea what he's doing sbassett [20:36:51] Ok, well I just git am'd the patch on wmf.6/extensions/CentralAuth [20:37:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:37:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:07] sbassett: i was investigating the very same thing :D [20:38:29] feel free to finish :) [20:38:31] urbanecm: Ok, I'm scapping out the files rn... [20:38:37] thx [20:38:56] (03PS1) 10Nray: Enable Table of Contents AB test on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779551 (https://phabricator.wikimedia.org/T302046) [20:38:59] !log re-deploy security patch for T226212 to wmf.6 - part 1 [20:39:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24528 and previous config saved to /var/cache/conftool/dbconfig/20220412-203900-ladsgroup.json [20:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:12] !log re-deploy security patch for T226212 to wmf.6 - part 2 [20:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:17] thx [20:42:30] I hope this also fixed that weird error [20:43:01] (03PS2) 10Nray: Enable Table of Contents AB test on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779551 (https://phabricator.wikimedia.org/T302046) [20:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24529 and previous config saved to /var/cache/conftool/dbconfig/20220412-205406-ladsgroup.json [20:54:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [20:54:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [20:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24530 and previous config saved to /var/cache/conftool/dbconfig/20220412-205414-ladsgroup.json [20:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:23] 10SRE, 10serviceops, 10PHP 7.2 support, 10PHP 7.3 support, 10Performance Issue: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Reedy) [20:59:03] (03CR) 10Cwhite: "Thanks for having a look!" [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [21:00:11] (03CR) 10Krinkle: [C: 03+1] "Good to go. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776258 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [21:04:33] (03PS2) 10Zabe: Stop writing to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956) [21:07:01] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:07:33] (03CR) 10Clare Ming: [C: 03+2] Enable Table of Contents AB test on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779551 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [21:08:14] (03Merged) 10jenkins-bot: Enable Table of Contents AB test on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779551 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [21:09:13] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:10:33] RECOVERY - Check systemd state on clouddb1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:13:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:13:22] !log razzi@clouddb1013:~$ sudo systemctl reset-failed wmf-pt-kill.service - the wmf-pt-kill@
.service units are running fine [21:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:13:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:09] urbanecm: quick Q if you're still around - I just merged https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/779551 thinking it's just a beta cluster config change and it will auto-sync to beta -- should I sync this anyway on the deployment + debug servers? if so, can I do that now or at my discretion? [21:14:54] cjming: so long only -labs files are changed, just a pull is sufficient [21:15:08] (syncing it doesn't hurt, but it also doesn't do anything) [21:16:03] !log milimetric@deploy1002 Started deploy [analytics/refinery@34be9f3]: Regular analytics weekly train [analytics/refinery@34be9f3] [21:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:12] urbanecm: when you say "just a pull", you mean rebase at /srv/mediawiki-staging on the deployment server? [21:17:21] yes [21:17:28] the git fetch git rebase sequence [21:17:39] gtk -- and is it ok to do that now? [21:18:01] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:18] (03PS1) 10Razzi: netboot: set reuse-db.cfg for clouddb10xx hosts [puppet] - 10https://gerrit.wikimedia.org/r/779557 (https://phabricator.wikimedia.org/T299480) [21:19:40] cjming: yup [21:19:50] cool - thx [21:19:58] it's fine to do at (almost) any time :) [21:21:05] duly noted! [21:32:14] (03PS2) 10RLazarus: sretest: Remove absented external_clouds_vendors [puppet] - 10https://gerrit.wikimedia.org/r/779146 (https://phabricator.wikimedia.org/T270391) [21:32:22] (03CR) 10RLazarus: [C: 03+2] sretest: Remove absented external_clouds_vendors [puppet] - 10https://gerrit.wikimedia.org/r/779146 (https://phabricator.wikimedia.org/T270391) (owner: 10RLazarus) [21:37:28] !log milimetric@deploy1002 Finished deploy [analytics/refinery@34be9f3]: Regular analytics weekly train [analytics/refinery@34be9f3] (duration: 21m 24s) [21:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24531 and previous config saved to /var/cache/conftool/dbconfig/20220412-214642-ladsgroup.json [21:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:57:41] (03PS2) 10RLazarus: external_cloud_vendors: Add a known-clients/Googlebot ipblock [puppet] - 10https://gerrit.wikimedia.org/r/779157 (https://phabricator.wikimedia.org/T305581) [21:57:49] (03CR) 10RLazarus: [C: 03+2] external_cloud_vendors: Add a known-clients/Googlebot ipblock [puppet] - 10https://gerrit.wikimedia.org/r/779157 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [21:59:11] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: Upgrading Elasticsearch to 6.8 in CODFW - bking@cumin1001 - T301958 [21:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:17] T301958: Upgrade Search elasticsearch cluster / codfw to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301958 [22:01:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24533 and previous config saved to /var/cache/conftool/dbconfig/20220412-220147-ladsgroup.json [22:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:05] (03CR) 10Razzi: "I was able to reimage clouddb1013 using this recipe just fine. All the similar clouddb1014-1021 have the same storage scheme." [puppet] - 10https://gerrit.wikimedia.org/r/779557 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [22:16:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24534 and previous config saved to /var/cache/conftool/dbconfig/20220412-221652-ladsgroup.json [22:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:25] (03PS3) 10Volans: service: add new module to expose service::catalog [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 [22:24:27] (03PS1) 10Volans: yaml files: fix indentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/779561 [22:28:45] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:31:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24535 and previous config saved to /var/cache/conftool/dbconfig/20220412-223158-ladsgroup.json [22:31:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [22:32:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [22:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24536 and previous config saved to /var/cache/conftool/dbconfig/20220412-223206-ladsgroup.json [22:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:34:22] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: Upgrading Elasticsearch to 6.8 in CODFW - bking@cumin1001 - T301958 [22:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:25] T301958: Upgrade Search elasticsearch cluster / codfw to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301958 [22:38:08] (03CR) 10Razzi: "I triple checked this and I'm fine with it :)" [puppet] - 10https://gerrit.wikimedia.org/r/779557 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [22:38:13] (03CR) 10Razzi: [C: 03+2] netboot: set reuse-db.cfg for clouddb10xx hosts [puppet] - 10https://gerrit.wikimedia.org/r/779557 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [22:39:06] !log T305646 Re-enabling puppet on `elastic2033`; still need to unban from elasticsearch cluster tomorrow [22:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:09] T305646: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 [22:46:23] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrade to bullseye [22:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:25] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1014.eqiad.wmnet with reason: Upgrade to bullseye [22:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:17] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1014.eqiad.wmnet with OS bullseye [22:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:32] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1014.eqiad.wmnet with reason: host reimage [22:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:01] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1014.eqiad.wmnet with reason: host reimage [23:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:42] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [23:14:44] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [23:14:58] (03CR) 10Cwhite: "Looks good to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/779086 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [23:17:57] (03CR) 10Cwhite: [C: 03+2] "PCC checks out https://puppet-compiler.wmflabs.org/pcc-worker1002/34810/" [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [23:23:10] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1014.eqiad.wmnet with OS bullseye [23:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:38] (03PS1) 10Razzi: dbproxy: repool all hosts after finishing reimages for day [puppet] - 10https://gerrit.wikimedia.org/r/779568 (https://phabricator.wikimedia.org/T299480) [23:29:01] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:29:22] (03CR) 10Razzi: [C: 03+2] dbproxy: repool all hosts after finishing reimages for day [puppet] - 10https://gerrit.wikimedia.org/r/779568 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [23:32:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24537 and previous config saved to /var/cache/conftool/dbconfig/20220412-233248-ladsgroup.json [23:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:47:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24538 and previous config saved to /var/cache/conftool/dbconfig/20220412-234753-ladsgroup.json [23:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:41] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: Upgrading Elasticsearch to 6.8 in CODFW - bking@cumin1001 - T301958 [23:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:45] T301958: Upgrade Search elasticsearch cluster / codfw to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301958 [23:51:47] 10SRE, 10MediaWiki-General, 10Performance-Team, 10Platform Engineering Code Jam, and 3 others: Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10tstarling) 05Open→03Resolved I think this was comp...