[00:02:13] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [00:02:45] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:03:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24350 and previous config saved to /var/cache/conftool/dbconfig/20220411-000302-ladsgroup.json [00:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:04:17] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01002 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:13:01] RECOVERY - Check systemd state on elastic1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24351 and previous config saved to /var/cache/conftool/dbconfig/20220411-001807-ladsgroup.json [00:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:45] (03PS1) 10BryanDavis: dev: Update Vagrantfile to Debian Bullseye [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/778682 [00:19:47] (03PS1) 10BryanDavis: Add perl532-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/778683 (https://phabricator.wikimedia.org/T214343) [00:22:01] (03CR) 10BryanDavis: [C: 03+2] dev: Update Vagrantfile to Debian Bullseye [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/778682 (owner: 10BryanDavis) [00:23:07] (03Merged) 10jenkins-bot: dev: Update Vagrantfile to Debian Bullseye [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/778682 (owner: 10BryanDavis) [00:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24352 and previous config saved to /var/cache/conftool/dbconfig/20220411-003312-ladsgroup.json [00:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24353 and previous config saved to /var/cache/conftool/dbconfig/20220411-004817-ladsgroup.json [00:48:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [00:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [00:48:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24354 and previous config saved to /var/cache/conftool/dbconfig/20220411-004826-ladsgroup.json [00:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:39] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 49.59 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:00:03] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 48.43 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:01:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:02:19] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:15:59] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24355 and previous config saved to /var/cache/conftool/dbconfig/20220411-014316-ladsgroup.json [01:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24356 and previous config saved to /var/cache/conftool/dbconfig/20220411-015822-ladsgroup.json [01:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:13:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24357 and previous config saved to /var/cache/conftool/dbconfig/20220411-021327-ladsgroup.json [02:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:15] PROBLEM - MariaDB Replica Lag: s3 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1359.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:28:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24358 and previous config saved to /var/cache/conftool/dbconfig/20220411-022832-ladsgroup.json [02:28:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [02:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [02:28:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24359 and previous config saved to /var/cache/conftool/dbconfig/20220411-022840-ladsgroup.json [02:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:27] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:11:03] PROBLEM - MariaDB Replica Lag: s3 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1312.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:21:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24360 and previous config saved to /var/cache/conftool/dbconfig/20220411-032132-ladsgroup.json [03:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:36:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24361 and previous config saved to /var/cache/conftool/dbconfig/20220411-033638-ladsgroup.json [03:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:35] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24362 and previous config saved to /var/cache/conftool/dbconfig/20220411-035143-ladsgroup.json [03:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [04:06:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24363 and previous config saved to /var/cache/conftool/dbconfig/20220411-040648-ladsgroup.json [04:06:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [04:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [04:06:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24364 and previous config saved to /var/cache/conftool/dbconfig/20220411-040656-ladsgroup.json [04:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:29] (03CR) 10Santhosh: [C: 03+1] Add SectionTranslation entry points as campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778381 (https://phabricator.wikimedia.org/T298029) (owner: 10KartikMistry) [04:16:37] RECOVERY - MariaDB Replica Lag: s3 on db2139 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:40:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1164', diff saved to https://phabricator.wikimedia.org/P24365 and previous config saved to /var/cache/conftool/dbconfig/20220411-044058-root.json [04:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:42:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T297189)', diff saved to https://phabricator.wikimedia.org/P24366 and previous config saved to /var/cache/conftool/dbconfig/20220411-044302-marostegui.json [04:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:06] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [04:47:57] (03PS1) 10Marostegui: Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/778634 [04:53:24] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/778634 (owner: 10Marostegui) [04:55:33] (03CR) 10Marostegui: [C: 03+2] Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/778634 (owner: 10Marostegui) [04:58:14] (03PS1) 10Marostegui: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/778688 (https://phabricator.wikimedia.org/T304933) [04:58:33] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/778688 (https://phabricator.wikimedia.org/T304933) (owner: 10Marostegui) [04:59:36] (03PS1) 10Marostegui: wmnet: Update s4 CNAME [dns] - 10https://gerrit.wikimedia.org/r/778689 (https://phabricator.wikimedia.org/T304933) [05:00:01] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/778689 (https://phabricator.wikimedia.org/T304933) (owner: 10Marostegui) [05:00:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24367 and previous config saved to /var/cache/conftool/dbconfig/20220411-050055-ladsgroup.json [05:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:59] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:15:01] RECOVERY - MariaDB Replica Lag: s3 on db1145 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:16:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24368 and previous config saved to /var/cache/conftool/dbconfig/20220411-051600-ladsgroup.json [05:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:59] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:24:49] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Marostegui) [05:26:39] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Marostegui) @Papaul the databases and es2029/es2030 are ready for relocation. Please turn them ON once you are done For what is worth, es2029 and es2030 are scheduled to be done 14th, whic... [05:31:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24369 and previous config saved to /var/cache/conftool/dbconfig/20220411-053105-ladsgroup.json [05:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:07] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:43:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1164', diff saved to https://phabricator.wikimedia.org/P24370 and previous config saved to /var/cache/conftool/dbconfig/20220411-054306-root.json [05:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1164', diff saved to https://phabricator.wikimedia.org/P24371 and previous config saved to /var/cache/conftool/dbconfig/20220411-054508-root.json [05:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24372 and previous config saved to /var/cache/conftool/dbconfig/20220411-054610-ladsgroup.json [05:46:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [05:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [05:46:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24373 and previous config saved to /var/cache/conftool/dbconfig/20220411-054618-ladsgroup.json [05:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1164', diff saved to https://phabricator.wikimedia.org/P24374 and previous config saved to /var/cache/conftool/dbconfig/20220411-054902-root.json [05:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T297189)', diff saved to https://phabricator.wikimedia.org/P24375 and previous config saved to /var/cache/conftool/dbconfig/20220411-055037-marostegui.json [05:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:41] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [05:56:34] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:05:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24376 and previous config saved to /var/cache/conftool/dbconfig/20220411-060542-marostegui.json [06:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:16:37] (03CR) 10Nik Gkountas: [C: 03+1] Add SectionTranslation entry points as campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778381 (https://phabricator.wikimedia.org/T298029) (owner: 10KartikMistry) [06:20:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P24377 and previous config saved to /var/cache/conftool/dbconfig/20220411-062047-marostegui.json [06:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T297189)', diff saved to https://phabricator.wikimedia.org/P24378 and previous config saved to /var/cache/conftool/dbconfig/20220411-063552-marostegui.json [06:35:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:35:57] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [06:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T297189)', diff saved to https://phabricator.wikimedia.org/P24379 and previous config saved to /var/cache/conftool/dbconfig/20220411-063601-marostegui.json [06:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24380 and previous config saved to /var/cache/conftool/dbconfig/20220411-064033-ladsgroup.json [06:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:55:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24381 and previous config saved to /var/cache/conftool/dbconfig/20220411-065538-ladsgroup.json [06:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:48] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24382 and previous config saved to /var/cache/conftool/dbconfig/20220411-071043-ladsgroup.json [07:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24383 and previous config saved to /var/cache/conftool/dbconfig/20220411-072548-ladsgroup.json [07:25:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [07:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [07:25:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24384 and previous config saved to /var/cache/conftool/dbconfig/20220411-072556-ladsgroup.json [07:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:21] !log restarting blazegraph on wdqs1004 (BlazegraphFreeAllocatorsDecreasingRapidly fired over the week-end) [07:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T297189)', diff saved to https://phabricator.wikimedia.org/P24385 and previous config saved to /var/cache/conftool/dbconfig/20220411-073615-marostegui.json [07:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:19] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [07:45:29] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10MoritzMuehlenhoff) [07:46:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:51:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24386 and previous config saved to /var/cache/conftool/dbconfig/20220411-075120-marostegui.json [07:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1163', diff saved to https://phabricator.wikimedia.org/P24387 and previous config saved to /var/cache/conftool/dbconfig/20220411-075214-root.json [07:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:51] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@63cbb55]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@63cbb55] [07:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1163', diff saved to https://phabricator.wikimedia.org/P24388 and previous config saved to /var/cache/conftool/dbconfig/20220411-080047-root.json [08:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:13] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@63cbb55]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@63cbb55] (duration: 04m 21s) [08:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:16] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [08:03:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1163', diff saved to https://phabricator.wikimedia.org/P24389 and previous config saved to /var/cache/conftool/dbconfig/20220411-080344-root.json [08:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1135', diff saved to https://phabricator.wikimedia.org/P24390 and previous config saved to /var/cache/conftool/dbconfig/20220411-080402-root.json [08:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P24391 and previous config saved to /var/cache/conftool/dbconfig/20220411-080625-marostegui.json [08:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:00] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T297189)', diff saved to https://phabricator.wikimedia.org/P24392 and previous config saved to /var/cache/conftool/dbconfig/20220411-082130-marostegui.json [08:21:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:21:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [08:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:36] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [08:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:42] Can anyone update Deployments page on Wikitech? I'm not sure how to do it. [08:22:56] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@a337e34]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics_test@a337e34] [08:22:57] ie https://wikitech.wikimedia.org/wiki/Deployments lacking this and next week's schedule. [08:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:04] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@a337e34]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics_test@a337e34] (duration: 00m 08s) [08:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:18] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@a337e34]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@a337e34] [08:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:26] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@a337e34]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@a337e34] (duration: 00m 07s) [08:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24393 and previous config saved to /var/cache/conftool/dbconfig/20220411-082456-ladsgroup.json [08:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:27:01] (BlazegraphJvmQuakeWarnGC) resolved: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [08:29:18] 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, 10Kubernetes: service:.catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm) [08:34:52] jouncebot: nowandnext [08:34:52] No deployments scheduled for the forseeable future! [08:34:52] No deployments scheduled for the forseeable future! [08:35:33] Lucas_WMDE: https://wikitech.wikimedia.org/wiki/Deployments seems not updated :) [08:35:37] yes [08:38:48] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:40:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24394 and previous config saved to /var/cache/conftool/dbconfig/20220411-084001-ladsgroup.json [08:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:01] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [08:55:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24395 and previous config saved to /var/cache/conftool/dbconfig/20220411-085506-ladsgroup.json [08:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:57] * Lucas_WMDE experimenting on mwdebug1001 [08:59:33] (03PS1) 10KartikMistry: Update cxserver to 2022-04-11-085026-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/778988 (https://phabricator.wikimedia.org/T305125) [08:59:45] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2007.codfw.wmnet with OS bullseye [08:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:30] * Lucas_WMDE done [09:07:57] * kart_ updating cxserver.. [09:10:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24396 and previous config saved to /var/cache/conftool/dbconfig/20220411-091011-ladsgroup.json [09:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [09:10:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:10:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [09:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [09:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [09:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1135', diff saved to https://phabricator.wikimedia.org/P24397 and previous config saved to /var/cache/conftool/dbconfig/20220411-091103-root.json [09:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:51] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-04-11-085026-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/778988 (https://phabricator.wikimedia.org/T305125) (owner: 10KartikMistry) [09:13:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1135', diff saved to https://phabricator.wikimedia.org/P24398 and previous config saved to /var/cache/conftool/dbconfig/20220411-091319-root.json [09:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:01] (BlazegraphJvmQuakeWarnGC) resolved: Blazegraph instance wdqs1005:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [09:14:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1135', diff saved to https://phabricator.wikimedia.org/P24399 and previous config saved to /var/cache/conftool/dbconfig/20220411-091455-root.json [09:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P24400 and previous config saved to /var/cache/conftool/dbconfig/20220411-091512-root.json [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:05] (03Merged) 10jenkins-bot: Update cxserver to 2022-04-11-085026-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/778988 (https://phabricator.wikimedia.org/T305125) (owner: 10KartikMistry) [09:17:28] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:59] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:50] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2007.codfw.wmnet with reason: host reimage [09:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:24] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:17] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:49] (03PS1) 10MMandere: cache::varnish: Merge repeating host data to site data [puppet] - 10https://gerrit.wikimedia.org/r/778989 (https://phabricator.wikimedia.org/T290005) [09:24:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:24:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 9 hosts with reason: Maintenance [09:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 9 hosts with reason: Maintenance [09:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:05] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2007.codfw.wmnet with reason: host reimage [09:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:50] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [09:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:55] (03CR) 10Volans: [C: 03+2] "PCC confirms noop https://puppet-compiler.wmflabs.org/pcc-worker1001/34763/" [puppet] - 10https://gerrit.wikimedia.org/r/778331 (owner: 10Volans) [09:26:45] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [09:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:09] !log Updated cxserver to 2022-04-11-085026-production (T305125) [09:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:13] T305125: Enable Flores as the default service for Icelandic, Igbo and Zulu - https://phabricator.wikimedia.org/T305125 [09:28:22] (03PS6) 10Volans: spicerack: install service::catalog configuration [puppet] - 10https://gerrit.wikimedia.org/r/778333 [09:29:42] (03CR) 10MMandere: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34765/console" [puppet] - 10https://gerrit.wikimedia.org/r/778989 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:30:11] (03CR) 10Volans: [C: 03+2] "PCC happy https://puppet-compiler.wmflabs.org/pcc-worker1001/34764/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/778333 (owner: 10Volans) [09:36:54] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update s4 CNAME [dns] - 10https://gerrit.wikimedia.org/r/778689 (https://phabricator.wikimedia.org/T304933) (owner: 10Marostegui) [09:37:12] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/778688 (https://phabricator.wikimedia.org/T304933) (owner: 10Marostegui) [09:38:52] jouncebot: nowandnext [09:38:52] No deployments scheduled for the forseeable future! [09:38:52] No deployments scheduled for the forseeable future! [09:39:01] interesting [09:39:34] https://wikitech.wikimedia.org/wiki/Deployments has not been updated for this week yet [09:39:35] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2007.codfw.wmnet with OS bullseye [09:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:41] thcipriani: can we haz new calendar? [09:39:54] it can be because of Easter? [09:40:22] https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar only has Friday as a no-deploy day, not the whole week [09:40:44] noted [09:41:04] (03CR) 10Ladsgroup: [C: 03+2] Older browser do not return a promise from .play() [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/778238 (https://phabricator.wikimedia.org/T304705) (owner: 10TheDJ) [09:41:47] (03PS2) 10Ladsgroup: Enable videojs on wiktionary wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778197 (https://phabricator.wikimedia.org/T248418) [09:41:51] (03CR) 10Ladsgroup: [C: 03+2] Enable videojs on wiktionary wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778197 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup) [09:42:36] (03Merged) 10jenkins-bot: Enable videojs on wiktionary wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778197 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup) [09:44:26] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:778197|Enable videojs on wiktionary wikis (T248418)]] (duration: 00m 52s) [09:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:30] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [09:46:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:46:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:26] (03Merged) 10jenkins-bot: Older browser do not return a promise from .play() [extensions/TimedMediaHandler] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/778238 (https://phabricator.wikimedia.org/T304705) (owner: 10TheDJ) [09:57:38] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Kormat) >>! In T305469#7843940, @Marostegui wrote: > For what is worth, es2029 and es2030 are scheduled to be done 14th, which is a bank holiday for me, so someone else would need to bring... [09:58:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:58:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24401 and previous config saved to /var/cache/conftool/dbconfig/20220411-095826-ladsgroup.json [09:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:29] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:58:35] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Marostegui) Thanks @Kormat [09:58:59] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.6/extensions/TimedMediaHandler/resources/ext.tmh.player.element.js: Backport: [[gerrit:778238|Older browser do not return a promise from .play() (T304705)]] (duration: 00m 52s) [09:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:02] T304705: videojs TypeError: Cannot read property 'then' of undefined - https://phabricator.wikimedia.org/T304705 [10:01:21] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:01:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:06:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:07:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:21] 10SRE, 10conftool: Provide a meaningful Retry-After value - https://phabricator.wikimedia.org/T305824 (10Vgutierrez) [10:10:44] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:11] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:38] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:27:52] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:30:18] (03CR) 10MVernon: [C: 04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [10:33:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:33:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [10:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T297189)', diff saved to https://phabricator.wikimedia.org/P24402 and previous config saved to /var/cache/conftool/dbconfig/20220411-103336-marostegui.json [10:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:40] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [10:34:54] (03CR) 10Filippo Giunchedi: prometheus: enable prometheus web access via proxy with IDP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [10:37:18] (03CR) 10Filippo Giunchedi: [C: 04-1] "Generally LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [10:37:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 11 hosts with reason: Rebooting primary T303174 [10:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 11 hosts with reason: Rebooting primary T303174 [10:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2121.codfw.wmnet with reason: Rebooting for T303174 [10:38:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2121.codfw.wmnet with reason: Rebooting for T303174 [10:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:50] (03CR) 10Filippo Giunchedi: sre.kafka.reboot-workers: remove systemctl stop calls (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [10:39:11] (03PS3) 10Jcrespo: swift: Create a new read-only role on mw account for backup taking [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) [10:39:41] (03CR) 10Filippo Giunchedi: sre.kafka.reboot-workers: add --skip-mirrormaker option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [10:39:49] (03PS4) 10Jcrespo: swift: Create a new read-only role on mw account for backup taking [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) [10:39:51] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T304849 (10phaultfinder) [10:40:29] (03CR) 10Jcrespo: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [10:41:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2121.codfw.wmnet with reason: Rebooting for T303174 [10:41:39] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2121.codfw.wmnet with reason: Rebooting for T303174 [10:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:06] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [10:43:47] (03PS2) 10MMandere: cache::varnish: Merge repeating host data to common data [puppet] - 10https://gerrit.wikimedia.org/r/778989 (https://phabricator.wikimedia.org/T290005) [10:44:33] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:48:14] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:53:23] (03CR) 10MVernon: "One nit inline, and I agree with Filippo's comments, but otherwise this looks good to me, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:55:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24403 and previous config saved to /var/cache/conftool/dbconfig/20220411-105525-ladsgroup.json [10:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:00:39] (03PS7) 10Zabe: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) [11:01:13] (03CR) 10Zabe: swift: migrate stats_account cron to systemd timer job (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:02:28] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:02:58] (03CR) 10MVernon: "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:04:32] (03CR) 10Klausman: [C: 03+2] ml-services: add plwiki, ptwiki & rowiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/778251 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [11:05:40] (03CR) 10MMandere: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34766/console" [puppet] - 10https://gerrit.wikimedia.org/r/778989 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:06:06] (03CR) 10Vgutierrez: [C: 03+1] cache::varnish: Merge repeating host data to common data [puppet] - 10https://gerrit.wikimedia.org/r/778989 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:08:54] (03PS5) 10Btullis: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) [11:08:57] (03CR) 10MMandere: [V: 03+1 C: 03+2] cache::varnish: Merge repeating host data to common data [puppet] - 10https://gerrit.wikimedia.org/r/778989 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24404 and previous config saved to /var/cache/conftool/dbconfig/20220411-111030-ladsgroup.json [11:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:06] PROBLEM - Disk space on ml-staging-ctrl2002 is CRITICAL: DISK CRITICAL - free space: / 1129 MB (5% inode=95%): /tmp 1129 MB (5% inode=95%): /var/tmp 1129 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [11:14:44] 10SRE, 10SRE-Access-Requests: Requesting access to LDAP group NDA for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10jcrespo) @soworu: Apologies for the confusion- the procedure for requesting access to the Google Search Console has recently changed (2 weeks ago), as it is being now oversee... [11:18:03] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10jcrespo) [11:18:10] !log btullis@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [11:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:48] (03CR) 10Btullis: Configure LDAP authentication for DataHub (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [11:22:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1106', diff saved to https://phabricator.wikimedia.org/P24405 and previous config saved to /var/cache/conftool/dbconfig/20220411-112229-root.json [11:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1106', diff saved to https://phabricator.wikimedia.org/P24406 and previous config saved to /var/cache/conftool/dbconfig/20220411-112452-root.json [11:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24407 and previous config saved to /var/cache/conftool/dbconfig/20220411-112536-ladsgroup.json [11:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:42] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10jcrespo) [11:27:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1106', diff saved to https://phabricator.wikimedia.org/P24408 and previous config saved to /var/cache/conftool/dbconfig/20220411-112741-root.json [11:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P24409 and previous config saved to /var/cache/conftool/dbconfig/20220411-112825-root.json [11:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:32] (03CR) 10JMeybohm: "Apart from having just one LDAP server, this LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [11:32:14] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779014 (owner: 10Awight) [11:32:26] (03CR) 10Jcrespo: "Any preference in key generation (method/length) on the private server? I use openssl usually." [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [11:32:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add perl532-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/778683 (https://phabricator.wikimedia.org/T214343) (owner: 10BryanDavis) [11:32:48] (03CR) 10Awight: Remove configuration which is the same as the extension's default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779014 (owner: 10Awight) [11:34:56] !log Adjust loopback filter on cr3-ulsfo to align with L3 switch config. T304553. [11:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:00] T304553: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 [11:36:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T297189)', diff saved to https://phabricator.wikimedia.org/P24410 and previous config saved to /var/cache/conftool/dbconfig/20220411-113657-marostegui.json [11:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:01] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [11:40:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24411 and previous config saved to /var/cache/conftool/dbconfig/20220411-114041-ladsgroup.json [11:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:40:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [11:40:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [11:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24412 and previous config saved to /var/cache/conftool/dbconfig/20220411-114053-ladsgroup.json [11:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:48] !log Adjust loopback filter on asw1-b12-drmrs to align with CR router config. T304553. [11:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:52] T304553: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 [11:48:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: make grafana-cloud the preferred hostname [puppet] - 10https://gerrit.wikimedia.org/r/778674 (owner: 10Majavah) [11:49:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please collect +1 from Vivian." [puppet] - 10https://gerrit.wikimedia.org/r/778673 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [11:52:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P24413 and previous config saved to /var/cache/conftool/dbconfig/20220411-115202-marostegui.json [11:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:23] (03CR) 10Btullis: Configure LDAP authentication for DataHub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [11:56:31] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [11:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:00] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:24] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10SCherukuwada) [12:03:29] (03PS1) 10Zabe: snapshot: migrate adds-changes cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779016 (https://phabricator.wikimedia.org/T273673) [12:03:31] (03PS1) 10Zabe: snapshot: remove absented add-changes cron [puppet] - 10https://gerrit.wikimedia.org/r/779017 (https://phabricator.wikimedia.org/T273673) [12:04:50] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10SCherukuwada) [12:07:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P24414 and previous config saved to /var/cache/conftool/dbconfig/20220411-120707-marostegui.json [12:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:43] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/34768/" [puppet] - 10https://gerrit.wikimedia.org/r/779016 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:15:44] PROBLEM - Disk space on ml-staging-ctrl2002 is CRITICAL: DISK CRITICAL - free space: / 1074 MB (5% inode=95%): /tmp 1074 MB (5% inode=95%): /var/tmp 1074 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [12:18:17] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10SCherukuwada) @jcrespo I've been involved in this discussion so I know what's going on here. I've updated the ticket to reflect what they need. I can take care of providing... [12:21:55] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10RhinosF1) @KFrancis normally confirms NDAs [12:22:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T297189)', diff saved to https://phabricator.wikimedia.org/P24415 and previous config saved to /var/cache/conftool/dbconfig/20220411-122212-marostegui.json [12:22:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [12:22:16] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [12:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:19] (03CR) 10ArielGlenn: [C: 03+1] "Looks equivalent, thanks for picking this work back up." [puppet] - 10https://gerrit.wikimedia.org/r/779016 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T297189)', diff saved to https://phabricator.wikimedia.org/P24416 and previous config saved to /var/cache/conftool/dbconfig/20220411-122220-marostegui.json [12:22:23] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10jcrespo) >>! In T304502#7844616, @SCherukuwada wrote: > @jcrespo I've been involved in this discussion so I know what's going on here. I've updated the ticket to reflect wha... [12:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:22] (03CR) 10Zabe: "{{ping}} slight reminder that this still needs deployment :)" [puppet] - 10https://gerrit.wikimedia.org/r/751207 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:25:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Rebooting x2 codfw primary T303174 [12:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:29] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Rebooting x2 codfw primary T303174 [12:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2142.codfw.wmnet with reason: Rebooting for T303174 [12:25:58] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2142.codfw.wmnet with reason: Rebooting for T303174 [12:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:25] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1151.eqiad.wmnet with reason: Rebooting for T303174 [12:31:27] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1151.eqiad.wmnet with reason: Rebooting for T303174 [12:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford - https://phabricator.wikimedia.org/T305634 (10jcrespo) a:03jcrespo [12:32:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:34:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10jcrespo) [12:35:37] (03PS1) 10Zabe: graphite: migrate update_graphite_index cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779022 (https://phabricator.wikimedia.org/T273673) [12:35:39] (03PS1) 10Zabe: graphite: remove absented update_graphite_index cron [puppet] - 10https://gerrit.wikimedia.org/r/779023 (https://phabricator.wikimedia.org/T273673) [12:36:29] !log About to deploy analytics/refinery "Migrate mediarequest hourly from Oozie to Airflow" [12:36:30] ^^^ this BGP alert was due to BFD session failing towards doh1001. Restored without intervention about a minute later. [12:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:39] !log aqu@deploy1002 Started deploy [analytics/refinery@f0a1656]: Migrate mediarequest hourly from Oozie to Airflow [analytics/refinery@f0a1656] [12:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24417 and previous config saved to /var/cache/conftool/dbconfig/20220411-123906-ladsgroup.json [12:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:39:58] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/34769/graphite2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/779022 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:44:57] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10jcrespo) Hey, @drochford, While I check and process your access request, would you mind linking your Wikitech/LDAP account on your... [12:47:49] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@cae0024]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@cae0024] [12:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:21] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@cae0024]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@cae0024] (duration: 00m 32s) [12:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10jcrespo) [12:50:11] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [12:54:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24418 and previous config saved to /var/cache/conftool/dbconfig/20220411-125411-ladsgroup.json [12:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:32] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), 10Sustainability (Incident Followup): Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) Apologies for the delay, python implementation looks good to me and I agree p... [12:54:34] topranks: thanks, that's interesting though. I will check! [12:55:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though given how critical (hah) this plugin is I'd recommend even basic unit/integration tests" [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [12:56:09] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10Sustainability (Incident Followup): Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) [12:57:28] PROBLEM - Disk space on ml-staging-ctrl2002 is CRITICAL: DISK CRITICAL - free space: / 1096 MB (5% inode=95%): /tmp 1096 MB (5% inode=95%): /var/tmp 1096 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [12:57:34] (03PS2) 10Arturo Borrero Gonzalez: hieradata: use ntp servers private ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/777755 (owner: 10Majavah) [12:57:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: use puppet-enc hostname in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/778574 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [12:57:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10jcrespo) [12:58:02] !log aqu@deploy1002 Finished deploy [analytics/refinery@f0a1656]: Migrate mediarequest hourly from Oozie to Airflow [analytics/refinery@f0a1656] (duration: 20m 23s) [12:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: use ntp servers private ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/777755 (owner: 10Majavah) [12:59:10] PROBLEM - Disk space on ml-staging-ctrl2001 is CRITICAL: DISK CRITICAL - free space: / 1116 MB (5% inode=95%): /tmp 1116 MB (5% inode=95%): /var/tmp 1116 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2001&var-datasource=codfw+prometheus/ops [13:01:06] 10SRE, 10Goal, 10MW-1.38-notes (1.38.0-wmf.4; 2021-10-12), 10Patch-For-Review, and 2 others: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10lmata) [13:03:21] !log aqu@deploy1002 Started deploy [analytics/refinery@f0a1656] (thin): Migrate mediarequest hourly from Oozie to Airflow [analytics/refinery@f0a1656] [13:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:29] !log aqu@deploy1002 Finished deploy [analytics/refinery@f0a1656] (thin): Migrate mediarequest hourly from Oozie to Airflow [analytics/refinery@f0a1656] (duration: 00m 07s) [13:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:09] !log aqu@deploy1002 Started deploy [analytics/refinery@f0a1656] (hadoop-test): Migrate mediarequest hourly from Oozie to Airflow [analytics/refinery@f0a1656] [13:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:52] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:05] (03PS1) 10Jcrespo: admin: Add drochford to analytics-privatedata-users for superset [puppet] - 10https://gerrit.wikimedia.org/r/779024 (https://phabricator.wikimedia.org/T305634) [13:05:30] (03CR) 10Jcrespo: [C: 04-1] "Blocked on data engineering's ok." [puppet] - 10https://gerrit.wikimedia.org/r/779024 (https://phabricator.wikimedia.org/T305634) (owner: 10Jcrespo) [13:06:06] (03PS1) 10Ottomata: eventlogging - ReadingDepth schema has been deleted, don't attempt to ingest it [puppet] - 10https://gerrit.wikimedia.org/r/779025 [13:07:20] (03CR) 10Ottomata: "'Deleting' the schema caused errors as the Hive ingestion step tried to look up the latest schema for new ReadingDepth events that were st" [puppet] - 10https://gerrit.wikimedia.org/r/779025 (owner: 10Ottomata) [13:09:01] (03CR) 10Ottomata: [C: 03+2] eventlogging - ReadingDepth schema has been deleted, don't attempt to ingest it [puppet] - 10https://gerrit.wikimedia.org/r/779025 (owner: 10Ottomata) [13:09:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10jcrespo) a:05jcrespo→03Ottomata This is only blocked on Data Engineering, as owners of the service, to ap... [13:09:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10jcrespo) p:05Triage→03High [13:09:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24419 and previous config saved to /var/cache/conftool/dbconfig/20220411-130916-ladsgroup.json [13:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:12] (03PS1) 10Ladsgroup: admin: Fix real name [puppet] - 10https://gerrit.wikimedia.org/r/779026 [13:10:38] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10jcrespo) a:05TomekSikora.Monsoon→03None [13:10:57] (03CR) 10Ssingh: [C: 03+1] "Per discussion." [puppet] - 10https://gerrit.wikimedia.org/r/779026 (owner: 10Ladsgroup) [13:11:10] !log aqu@deploy1002 Finished deploy [analytics/refinery@f0a1656] (hadoop-test): Migrate mediarequest hourly from Oozie to Airflow [analytics/refinery@f0a1656] (duration: 07m 00s) [13:11:12] (03PS2) 10Ladsgroup: admin: Fix real name [puppet] - 10https://gerrit.wikimedia.org/r/779026 [13:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:16] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Fix real name [puppet] - 10https://gerrit.wikimedia.org/r/779026 (owner: 10Ladsgroup) [13:15:01] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1013:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [13:15:08] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10lmata) [13:18:29] (03CR) 10Btullis: [C: 03+2] Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [13:20:01] (BlazegraphJvmQuakeWarnGC) resolved: Blazegraph instance wdqs1013:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [13:20:12] (03CR) 10MSantos: [C: 03+1] maps: Re-enable OSM sync for on eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/772453 (https://phabricator.wikimedia.org/T304984) (owner: 10Jgiannelos) [13:20:17] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10Ottomata) Hello! Yes: https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_access [13:21:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10drochford) >>! In T305634#7844679, @jcrespo wrote: > Hey, @drochford, > > While I check and process your acc... [13:22:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10Ottomata) Approved [13:22:33] (03CR) 10JMeybohm: Configure LDAP authentication for DataHub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [13:22:54] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) a:03MatthewVernon I'm not sure I'm going to do anything about xfs options, but I am going to start reimaging hosts to Bullseye, and going to use this task to trac... [13:22:56] (03Merged) 10jenkins-bot: Configure LDAP authentication for DataHub [deployment-charts] - 10https://gerrit.wikimedia.org/r/778345 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [13:23:17] 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, 10Kubernetes: service::catalog entries and dnsdisc for Kubernetes services under Ingress - https://phabricator.wikimedia.org/T305358 (10JMeybohm) [13:24:17] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [13:24:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24420 and previous config saved to /var/cache/conftool/dbconfig/20220411-132422-ladsgroup.json [13:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:24:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [13:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [13:24:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [13:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [13:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10jcrespo) a:05Ottomata→03jcrespo Thank you a lot, drochford. Anything that helps us process request faster... [13:25:47] PROBLEM - MariaDB Replica Lag: x2 #page on db1153 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3256.88 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:25:54] * volans here [13:25:59] (03CR) 10Jcrespo: admin: Add drochford to analytics-privatedata-users for superset [puppet] - 10https://gerrit.wikimedia.org/r/779024 (https://phabricator.wikimedia.org/T305634) (owner: 10Jcrespo) [13:26:05] isn't x2 not yet in production? [13:26:06] o/ [13:26:08] x2 is not used [13:26:10] it is not [13:26:11] kormat: ^^^ [13:26:12] don't worry [13:26:23] I resolve it [13:26:23] probably shouldn't be p.age enabled then :) [13:26:30] was about to say the same [13:26:33] #FALSE_ALARM [13:26:43] for once I am quicker than volans! \o/ [13:26:54] ;) [13:26:56] lol [13:26:56] that's an accomplishment and a half right there :) [13:26:58] the reason why it was enabled is caused we were told months ago that it would go to production [13:27:12] soonTM [13:29:13] `Slave_IO_State: Waiting to reconnect after a failed master event read` [13:29:23] there's something unhappy between db1153 and db1151, which was rebooted earlier. [13:30:08] `show slave hosts` on db1151 whos db1153 ~28 times [13:30:24] i've no idea what's going on there [13:30:29] and i'm out sick [13:30:33] marostegui: can i leave this with you? [13:30:39] yes [13:30:42] ty <3 [13:32:37] RECOVERY - MariaDB Replica Lag: x2 #page on db1153 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:33:16] I have fixed it, but db2143 has been down for 8h too, why is that? kormat? [13:33:46] Ah, it is the one for the onsite maintenance [13:33:49] Anyways [13:36:52] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10jcrespo) @KFrancis can you help us confirm this (SREs don't have access to the legal ticket system). [13:38:09] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [13:39:25] (03PS1) 10Btullis: Add the codfw LDAP server to the DataHub JAAS configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/779031 (https://phabricator.wikimedia.org/T301454) [13:40:32] (03CR) 10Btullis: Add the codfw LDAP server to the DataHub JAAS configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/779031 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:42:39] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Update git repo to correspond to the actual running files [wikitech-static] - 10https://gerrit.wikimedia.org/r/775396 (owner: 10Andrew Bogott) [13:42:51] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] import-wikitech.sh: nukeNS.php --ns 8 before import [wikitech-static] - 10https://gerrit.wikimedia.org/r/775397 (owner: 10Andrew Bogott) [13:44:06] (03CR) 10JMeybohm: [C: 03+1] Add the codfw LDAP server to the DataHub JAAS configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/779031 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:44:17] (03CR) 10Btullis: [C: 03+2] Add the codfw LDAP server to the DataHub JAAS configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/779031 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:44:32] (03PS1) 10Zabe: acme_chief: migrate acme-chief-designate-tidyup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) [13:44:34] (03PS1) 10Zabe: acme_chief: remove absented acme-chief-designate-tidyup cron [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) [13:45:13] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: migrate acme-chief-designate-tidyup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:45:35] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: remove absented acme-chief-designate-tidyup cron [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:46:43] (03CR) 10Hnowlan: [C: 03+2] maps: Re-enable OSM sync for on eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/772453 (https://phabricator.wikimedia.org/T304984) (owner: 10Jgiannelos) [13:46:46] (03PS2) 10Zabe: acme_chief: migrate acme-chief-designate-tidyup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779032 (https://phabricator.wikimedia.org/T273673) [13:48:16] (03Merged) 10jenkins-bot: Add the codfw LDAP server to the DataHub JAAS configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/779031 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:48:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T297189)', diff saved to https://phabricator.wikimedia.org/P24421 and previous config saved to /var/cache/conftool/dbconfig/20220411-134848-marostegui.json [13:48:49] (03CR) 10MVernon: [C: 03+1] swift: Create a new read-only role on mw account for backup taking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773298 (https://phabricator.wikimedia.org/T269108) (owner: 10Jcrespo) [13:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:53] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [13:51:50] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:52:10] (03PS1) 10Marostegui: x2: Disable notifications for x2 DBs [puppet] - 10https://gerrit.wikimedia.org/r/779034 [13:53:05] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1012.eqiad.wmnet with OS bullseye [13:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:08] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1012.eqiad.wmnet with OS bullseye [13:53:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P24422 and previous config saved to /var/cache/conftool/dbconfig/20220411-135343-root.json [13:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:01] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1007:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [13:54:03] (03CR) 10Marostegui: [C: 03+2] x2: Disable notifications for x2 DBs [puppet] - 10https://gerrit.wikimedia.org/r/779034 (owner: 10Marostegui) [13:55:34] (03Abandoned) 10Andrew Bogott: openstack:haproxy add tls for nova metadata service [puppet] - 10https://gerrit.wikimedia.org/r/732398 (https://phabricator.wikimedia.org/T267194) (owner: 10Andrew Bogott) [13:56:51] PROBLEM - MariaDB read only x2 #page on db2142 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.4.22-MariaDB-log, Uptime 5241s, event_scheduler: True, 16.60 QPS, connection latency: 0.004217s, query latency: 0.000511s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:57:06] #FALSE_ALARM ... again [13:57:34] * volans acked on VO [13:57:39] m.arostegui is disabling paging for it [13:57:39] And I just pushed the codw to disable notifications [13:57:53] Anyways, I also fixed that too [13:59:09] RECOVERY - MariaDB read only x2 #page on db2142 is OK: Version 10.4.22-MariaDB-log, Uptime 5379s, read_only: False, event_scheduler: True, 16.55 QPS, connection latency: 0.004429s, query latency: 0.000496s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:59:21] thanks for the fix [13:59:46] Also fixed db1151 which would have paged too [14:03:00] (03PS1) 10Btullis: Use the LDAP read-only replicas for datahub authentication [deployment-charts] - 10https://gerrit.wikimedia.org/r/779039 (https://phabricator.wikimedia.org/T301462) [14:03:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P24423 and previous config saved to /var/cache/conftool/dbconfig/20220411-140353-marostegui.json [14:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P24424 and previous config saved to /var/cache/conftool/dbconfig/20220411-140415-root.json [14:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:25] (03CR) 10Majavah: [C: 03+1] Use the LDAP read-only replicas for datahub authentication [deployment-charts] - 10https://gerrit.wikimedia.org/r/779039 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [14:07:42] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1012.eqiad.wmnet with reason: host reimage [14:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:41] (03PS1) 10Zabe: ci: migrate gitcache crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/779040 (https://phabricator.wikimedia.org/T273673) [14:08:43] (03PS1) 10Zabe: ci: remove absented gitcache crons [puppet] - 10https://gerrit.wikimedia.org/r/779041 (https://phabricator.wikimedia.org/T273673) [14:09:14] (03CR) 10jerkins-bot: [V: 04-1] ci: migrate gitcache crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/779040 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:09:33] (03CR) 10jerkins-bot: [V: 04-1] ci: remove absented gitcache crons [puppet] - 10https://gerrit.wikimedia.org/r/779041 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:09:42] (03CR) 10Btullis: [C: 03+2] Use the LDAP read-only replicas for datahub authentication [deployment-charts] - 10https://gerrit.wikimedia.org/r/779039 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [14:09:56] (03PS2) 10Zabe: acme_chief: remove absented acme-chief-designate-tidyup cron [puppet] - 10https://gerrit.wikimedia.org/r/779033 (https://phabricator.wikimedia.org/T273673) [14:10:35] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1012.eqiad.wmnet with reason: host reimage [14:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [14:11:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [14:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:28] (03PS2) 10Zabe: ci: migrate gitcache crons to systemd timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/779040 (https://phabricator.wikimedia.org/T273673) [14:13:59] (03PS2) 10Zabe: ci: remove absented gitcache crons [puppet] - 10https://gerrit.wikimedia.org/r/779041 (https://phabricator.wikimedia.org/T273673) [14:14:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1119', diff saved to https://phabricator.wikimedia.org/P24425 and previous config saved to /var/cache/conftool/dbconfig/20220411-141428-root.json [14:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:36] (03Merged) 10jenkins-bot: Use the LDAP read-only replicas for datahub authentication [deployment-charts] - 10https://gerrit.wikimedia.org/r/779039 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [14:14:47] PROBLEM - Host an-worker1099 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:09] RECOVERY - Host an-worker1099 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:17:09] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:13] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P24426 and previous config saved to /var/cache/conftool/dbconfig/20220411-141858-marostegui.json [14:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:05] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) p:05Triage→03Medium [14:21:27] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) I'll chase this down today, I got the notice of processing but no shipment so I'll need to email Dell and find out what happened with this. [14:21:40] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) p:05Medium→03High [14:22:15] !log powerdown ganeti2019 for relocation [14:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:39] this might be a stupid question, but how does one actually schedule a dedicated deployment window, if you think you need one? (in this case, for a maintenance script that might need more than an hour) [14:22:51] (03Abandoned) 10Majavah: wmcs: toolforge: add_grid_webgrid_generic_node: fix description [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749711 (owner: 10Majavah) [14:22:53] can I just add it to the deployment calendar myself? (once the calendar for this week materializes, that is ^^) [14:23:00] that part isn’t really clear to me from https://wikitech.wikimedia.org/wiki/Deployments/Inclusion_criteria [14:23:14] Lucas_WMDE: yes, just add it to the calendar [14:23:20] ok thanks :) [14:24:49] PROBLEM - Host ganeti2019 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:29] PROBLEM - Host ganeti2019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:27:04] (03PS1) 10Btullis: Remove override for datahub-frontend staging egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/779045 (https://phabricator.wikimedia.org/T301462) [14:29:25] RECOVERY - Check systemd state on an-worker1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:55] RECOVERY - Host ganeti2019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.78 ms [14:33:14] (03CR) 10Btullis: [C: 03+2] Remove override for datahub-frontend staging egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/779045 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [14:33:23] (03PS1) 10JMeybohm: Add all members of the ops group to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) [14:33:55] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Update ldap role names [labs/private] - 10https://gerrit.wikimedia.org/r/776188 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [14:34:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T297189)', diff saved to https://phabricator.wikimedia.org/P24427 and previous config saved to /var/cache/conftool/dbconfig/20220411-143403-marostegui.json [14:34:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:34:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:08] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [14:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T297189)', diff saved to https://phabricator.wikimedia.org/P24428 and previous config saved to /var/cache/conftool/dbconfig/20220411-143411-marostegui.json [14:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:44] (03PS3) 10Majavah: Rename O:ldap::labs to O:ldap::rw [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150) [14:34:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1012.eqiad.wmnet with OS bullseye [14:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:00] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1012.eqiad.wmnet with OS bullseye completed: - ms-fe1012 (**WARN**) - Downtim... [14:35:39] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [14:36:11] RECOVERY - Host ganeti2019 is UP: PING OK - Packet loss = 0%, RTA = 171.67 ms [14:36:22] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34770/console" [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [14:37:37] PROBLEM - Host db2076.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:37:37] (03CR) 10Majavah: [V: 03+1] "I guess the main thing to be careful with this is to rename any hiera files in the real private git repo." [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [14:38:03] (03Merged) 10jenkins-bot: Remove override for datahub-frontend staging egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/779045 (https://phabricator.wikimedia.org/T301462) (owner: 10Btullis) [14:41:37] (03CR) 10Andrew Bogott: [C: 03+2] Rename O:ldap::labs to O:ldap::rw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [14:43:41] RECOVERY - Host db2076.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.96 ms [14:47:26] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:40] (03PS1) 10JMeybohm: Switch default group for Kubernetes credentials files to deployer [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) [14:48:16] (03CR) 10Andrew Bogott: [C: 03+2] striker: Use ldap-rw hostname for ldap [puppet] - 10https://gerrit.wikimedia.org/r/776189 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [14:48:40] (03CR) 10Vgutierrez: [C: 03+1] "looking good" [puppet] - 10https://gerrit.wikimedia.org/r/777899 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [14:49:19] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34773/console" [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [14:49:28] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:01] (03PS1) 10Majavah: hieradata: switch eqiad1 to use the new enc server [puppet] - 10https://gerrit.wikimedia.org/r/779049 (https://phabricator.wikimedia.org/T295247) [14:50:09] (03PS2) 10Andrew Bogott: dynamicproxy: remove support for x-novaproxy-edit-dns [puppet] - 10https://gerrit.wikimedia.org/r/777316 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [14:50:15] PROBLEM - Host db2086.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:50:16] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [14:50:39] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [14:50:55] (03PS1) 10MVernon: swift: handle new installs where there are no rings [puppet] - 10https://gerrit.wikimedia.org/r/779050 [14:52:07] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: remove support for x-novaproxy-edit-dns [puppet] - 10https://gerrit.wikimedia.org/r/777316 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [14:52:19] (03PS1) 10Majavah: P:toolforge: use puppetdb for grid hba data [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) [14:52:40] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=swift,service=nginx [14:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:51] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=swift,service=swift-fe [14:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] (03CR) 10Filippo Giunchedi: swift: handle new installs where there are no rings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779050 (owner: 10MVernon) [14:55:18] RECOVERY - Host db2086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.60 ms [14:55:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34774/console" [puppet] - 10https://gerrit.wikimedia.org/r/779051 (https://phabricator.wikimedia.org/T153163) (owner: 10Majavah) [14:56:20] (03CR) 10MVernon: swift: handle new installs where there are no rings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779050 (owner: 10MVernon) [14:57:23] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [14:59:48] PROBLEM - Host db2107.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:01:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [15:01:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [15:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24429 and previous config saved to /var/cache/conftool/dbconfig/20220411-150117-ladsgroup.json [15:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:05:18] RECOVERY - Host db2107.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.54 ms [15:05:55] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:07:19] (03PS14) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [15:07:42] PROBLEM - Host db2137.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:08:00] (03CR) 10jerkins-bot: [V: 04-1] prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [15:11:52] (03PS15) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [15:14:22] RECOVERY - Host db2137.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.15 ms [15:17:32] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:21:12] (03CR) 10Herron: prometheus: enable prometheus web access via proxy with IDP (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [15:24:54] (03CR) 10Ahmon Dancy: Add all members of the ops group to the deployment group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [15:26:10] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779060 (https://phabricator.wikimedia.org/T128546) [15:26:17] (03CR) 10Ahmon Dancy: [C: 04-1] Switch default group for Kubernetes credentials files to deployer (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [15:26:34] (03CR) 10Jcrespo: "Waiting for a review from someone else for merging." [puppet] - 10https://gerrit.wikimedia.org/r/779024 (https://phabricator.wikimedia.org/T305634) (owner: 10Jcrespo) [15:27:12] (03CR) 10Ahmon Dancy: "There's a commit message typo but I'm in favor of the change." [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [15:27:32] PROBLEM - Host db2147.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:28:00] (03CR) 10Majavah: [C: 04-1] "'deployment' needs to be added to the special ops groups list in modules/openldap/files/cross-validate-accounts.py" [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [15:30:05] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220411T1530). [15:30:25] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [15:30:34] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779060 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:31:12] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779060 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:31:31] (03PS4) 10Lucas Werkmeister (WMDE): Use wgRestAPIAdditionalRouteFiles for WB REST API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob) [15:33:08] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:779060| Bumping portals to master (T128546)]] (duration: 00m 56s) [15:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:13] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:33:50] RECOVERY - Host db2147.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [15:34:02] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:779060| Bumping portals to master (T128546)]] (duration: 00m 53s) [15:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:17] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10soworu) Hi @SCherukuwada. Charlene confirmed that there's an MSA on file. According to her feedback > "Monsoon signed out standard MSA for consulting work. It includes conf... [15:35:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:35:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:30] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Marostegui) mysql started on db* hosts [15:43:54] (03PS2) 10MVernon: swift: handle new installs where there are no rings [puppet] - 10https://gerrit.wikimedia.org/r/779050 [15:44:12] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Confirmed: https://codesearch.wmcloud.org/search/?q=KartographerUsePageLanguage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779014 (owner: 10Awight) [15:44:35] (03CR) 10MVernon: swift: handle new installs where there are no rings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779050 (owner: 10MVernon) [15:46:14] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:46:38] PROBLEM - Host es2029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:47:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T297189)', diff saved to https://phabricator.wikimedia.org/P24430 and previous config saved to /var/cache/conftool/dbconfig/20220411-154725-marostegui.json [15:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:30] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [15:49:32] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [15:50:05] (03PS1) 10CDanis: upload VCL: Only apply requestctl rules to external clients [puppet] - 10https://gerrit.wikimedia.org/r/779064 [15:51:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/779050 (owner: 10MVernon) [15:53:00] RECOVERY - Host es2029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.75 ms [15:53:43] (03CR) 10MVernon: [C: 03+2] swift: handle new installs where there are no rings [puppet] - 10https://gerrit.wikimedia.org/r/779050 (owner: 10MVernon) [15:54:24] (03PS5) 10Cathal Mooney: Add template to configure IPv6 RAs on CRs and L3 Switches [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) [15:56:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24431 and previous config saved to /var/cache/conftool/dbconfig/20220411-155620-ladsgroup.json [15:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:24] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:58:08] (03CR) 10Cathal Mooney: [C: 03+2] Add template to configure IPv6 RAs on CRs and L3 Switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [15:58:43] (03Merged) 10jenkins-bot: Add template to configure IPv6 RAs on CRs and L3 Switches [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:00:10] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [16:00:54] (03CR) 10Vgutierrez: [C: 03+1] upload VCL: Only apply requestctl rules to external clients [puppet] - 10https://gerrit.wikimedia.org/r/779064 (owner: 10CDanis) [16:01:49] (03CR) 10CDanis: [C: 03+1] external_clouds_vendors: Support entity types besides "cloud" [puppet] - 10https://gerrit.wikimedia.org/r/777899 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [16:02:00] jouncebot: nowandnext [16:02:00] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [16:02:00] In 0 hour(s) and 57 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220411T1700) [16:02:15] ok, I’ll deploy a config change that *should* only affect beta [16:02:24] (but it’s in a non-labs file so I’ll still test and sync it) [16:02:27] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use wgRestAPIAdditionalRouteFiles for WB REST API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob) [16:02:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P24432 and previous config saved to /var/cache/conftool/dbconfig/20220411-160230-marostegui.json [16:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:34] (03PS2) 10Majavah: hieradata: switch eqiad1 to use the new enc server [puppet] - 10https://gerrit.wikimedia.org/r/779049 (https://phabricator.wikimedia.org/T295247) [16:03:07] (03Merged) 10jenkins-bot: Use wgRestAPIAdditionalRouteFiles for WB REST API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob) [16:04:26] testing on mwdebug1001 [16:04:42] looks good, syncing [16:04:55] (03PS3) 10Majavah: hieradata: switch eqiad1 to use the new enc server [puppet] - 10https://gerrit.wikimedia.org/r/779049 (https://phabricator.wikimedia.org/T295247) [16:05:42] PROBLEM - Host es2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:05:46] (03CR) 10BBlack: "Right idea! But there's already such a clause (~60 lines up where's not so obvious) in the upload case. It's the equivalent in text-front" [puppet] - 10https://gerrit.wikimedia.org/r/779064 (owner: 10CDanis) [16:05:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:01] (03CR) 10BBlack: [C: 04-1] upload VCL: Only apply requestctl rules to external clients [puppet] - 10https://gerrit.wikimedia.org/r/779064 (owner: 10CDanis) [16:06:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:06:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:06:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:774901|Use wgRestAPIAdditionalRouteFiles for WB REST API]] (duration: 00m 51s) [16:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:11] ok, I’m done [16:09:39] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [16:10:02] 10SRE-Access-Requests: Denial of Service due to repeated hits from a particular IP - https://phabricator.wikimedia.org/T305863 (10ERayfield) [16:11:14] (03CR) 10Vgutierrez: [C: 03+1] upload VCL: Only apply requestctl rules to external clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779064 (owner: 10CDanis) [16:11:19] (03CR) 10Vgutierrez: upload VCL: Only apply requestctl rules to external clients [puppet] - 10https://gerrit.wikimedia.org/r/779064 (owner: 10CDanis) [16:11:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24434 and previous config saved to /var/cache/conftool/dbconfig/20220411-161125-ladsgroup.json [16:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:00] RECOVERY - Host es2030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [16:16:08] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:12] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10hnowlan) [16:17:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P24435 and previous config saved to /var/cache/conftool/dbconfig/20220411-161735-marostegui.json [16:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:01] (03PS1) 10Vgutierrez: vcl: Fix X-Abuse-Network typo [puppet] - 10https://gerrit.wikimedia.org/r/779068 (https://phabricator.wikimedia.org/T302471) [16:20:39] !log powerdown maps2006 for relocation [16:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:12] (03CR) 10CDanis: [C: 03+1] vcl: Fix X-Abuse-Network typo [puppet] - 10https://gerrit.wikimedia.org/r/779068 (https://phabricator.wikimedia.org/T302471) (owner: 10Vgutierrez) [16:23:01] 10SRE, 10Traffic: Denial of Service due to repeated hits from a particular IP - https://phabricator.wikimedia.org/T305863 (10RLazarus) Routing to #traffic to see if this is a VCL rule we're hitting. @ERayfield Can you provide some example requests, with headers and source IP? I'm going to preemptively make t... [16:23:46] PROBLEM - Host maps2006 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:42] PROBLEM - Host maps2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:25:30] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24436 and previous config saved to /var/cache/conftool/dbconfig/20220411-162630-ladsgroup.json [16:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:00] (03PS1) 10Majavah: hieradata: switch to ldap-rw naming on ldap hosts [puppet] - 10https://gerrit.wikimedia.org/r/779071 (https://phabricator.wikimedia.org/T295150) [16:29:06] RhinosF1: maps2006 should be back up online [16:29:26] RECOVERY - Host maps2006 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [16:29:43] papaul: relayed [16:29:53] thanks! [16:30:06] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34775/console" [puppet] - 10https://gerrit.wikimedia.org/r/776187 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [16:30:11] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@cae0024]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@cae0024] [16:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:19] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@cae0024]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@cae0024] (duration: 00m 08s) [16:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:32] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [16:31:02] RECOVERY - Host maps2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms [16:31:38] (03CR) 10Vgutierrez: [C: 03+2] vcl: Fix X-Abuse-Network typo [puppet] - 10https://gerrit.wikimedia.org/r/779068 (https://phabricator.wikimedia.org/T302471) (owner: 10Vgutierrez) [16:31:44] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34776/console" [puppet] - 10https://gerrit.wikimedia.org/r/779071 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [16:32:14] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [16:32:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T297189)', diff saved to https://phabricator.wikimedia.org/P24437 and previous config saved to /var/cache/conftool/dbconfig/20220411-163240-marostegui.json [16:32:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:32:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:45] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [16:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T297189)', diff saved to https://phabricator.wikimedia.org/P24438 and previous config saved to /var/cache/conftool/dbconfig/20220411-163248-marostegui.json [16:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:29] 10SRE, 10ops-codfw, 10DBA: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Marostegui) mysql started on es* hosts [16:35:11] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Marostegui) Changing the tag as our DBA part here is done. If there's anything else required, I am still subscribed to the task. [16:36:14] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) @Marostegui thanks [16:41:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24439 and previous config saved to /var/cache/conftool/dbconfig/20220411-164136-ladsgroup.json [16:41:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [16:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [16:41:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24440 and previous config saved to /var/cache/conftool/dbconfig/20220411-164144-ladsgroup.json [16:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:16] (03PS1) 10BBlack: Exclude WMF cloud IPs from generic cloud limiter [puppet] - 10https://gerrit.wikimedia.org/r/779074 [16:46:23] (03CR) 10Vgutierrez: [C: 03+1] Exclude WMF cloud IPs from generic cloud limiter [puppet] - 10https://gerrit.wikimedia.org/r/779074 (owner: 10BBlack) [16:47:28] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:54:38] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:55:09] (03PS1) 10Btullis: Add a volume for the jaas-ldap configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/779077 (https://phabricator.wikimedia.org/T301454) [16:55:27] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs2001.codfw.wmnet with reason: physically moving host [16:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:29] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs2001.codfw.wmnet with reason: physically moving host [16:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:34] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=dc2c981d-aef2-4a2b-9d24-2e3ca912b985) set by bking@cumin1001 for 1 day, 0:00:00 on 1 host(s) an... [16:59:01] (BlazegraphJvmQuakeWarnGC) firing: (2) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [16:59:54] (03PS1) 10Zabe: Start writing to cuc_actor in guwwiki and shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779078 (https://phabricator.wikimedia.org/T233004) [17:00:05] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220411T1700). [17:01:15] (03CR) 10Btullis: [C: 03+2] Add a volume for the jaas-ldap configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/779077 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [17:03:30] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10akosiaris) [17:05:22] (03Merged) 10jenkins-bot: Add a volume for the jaas-ldap configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/779077 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [17:09:04] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb2008.codfw.wmnet with reason: moving to a different rack [17:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:06] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb2008.codfw.wmnet with reason: moving to a different rack [17:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:11] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9620461c-f770-40dd-99d6-2b4f895a2549) set by akosiaris@cumin1001 for 2:00:00 on 1 host(s) and t... [17:09:15] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: moving to a different rack [17:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:17] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: moving to a different rack [17:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:22] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6e1e84ea-fac8-4dde-be55-1bf6ea935f75) set by akosiaris@cumin1001 for 2:00:00 on 1 host(s) and t... [17:11:55] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2023.codfw.wmnet with reason: moving to a different rack [17:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:58] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2023.codfw.wmnet with reason: moving to a different rack [17:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:03] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5293ea70-e1a3-4862-ae77-82e8abf9cdd4) set by akosiaris@cumin1001 for 2:00:00 on 1 host(s) and t... [17:12:17] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10akosiaris) [17:14:14] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10akosiaris) I marked rdb2008, kubestage2002 and mc2023 as YES in the table. rdb2008 is the secondary, not the primary, kubestage2002 is for the staging cluster a... [17:15:21] (03PS16) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [17:15:55] (03CR) 10Herron: prometheus: enable prometheus web access via proxy with IDP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [17:17:10] PROBLEM - Host wcqs2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:17:10] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10SCherukuwada) I've given the above-mentioned e-mail address access to the two English Wikipedia domains (en.wikipedia.org and en.m.wikpedia.org). @Jaime Crespo 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10bking) [17:23:27] RECOVERY - Host wcqs2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.20 ms [17:23:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) p:05High→03Unbreak! >>! In T299443#7841687, @cmooney wrote: > FYI I believe PXE is failing for dumpsdata1006 as the DAC cable is plugged into the... [17:24:08] !log powerdown kubestage2002 for relocation [17:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:48] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:26:57] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:27:36] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [17:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:14] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [17:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:59] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) I can not power down kuberstage2002 ` W: aborting poweroff due to 30-query-hostname exiting with code 1. [17:31:41] !log powerdown rdb2008 for relocation [17:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:02] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:34:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T297189)', diff saved to https://phabricator.wikimedia.org/P24442 and previous config saved to /var/cache/conftool/dbconfig/20220411-173423-marostegui.json [17:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:27] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [17:35:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3:(Need By: TBD) rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) p:05Unbreak!→03Medium I worked around the issue via idrac and piping output to a text file to make up for the idrac serial screen issue of not get... [17:37:34] PROBLEM - Host rdb2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:37:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24443 and previous config saved to /var/cache/conftool/dbconfig/20220411-173735-ladsgroup.json [17:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:39] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:37:43] 10Puppet, 10SRE, 10Infrastructure-Foundations: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 (10Dzahn) The Debian package [[[ https://packages.debian.org/bullseye/yamllint | yamllint ]] exists in bullseye nowadays and works. examples: ` /puppet/hieradata$ yamllint cloud... [17:37:58] RECOVERY - Host rdb2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.65 ms [17:38:44] (03CR) 10Ebernhardson: [C: 03+1] "Seems reasonable, verified functionality is also in 6.5." [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [17:41:01] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:42:18] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:43:01] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:43:28] (03CR) 10Ebernhardson: [C: 03+1] elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570) (owner: 10Ryan Kemper) [17:45:09] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:47:26] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) @hnowlan will it be possible to get me restbase2021 offline on April 14th at 9:30am CT? thanks. [17:48:24] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 45.14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [17:49:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P24444 and previous config saved to /var/cache/conftool/dbconfig/20220411-174928-marostegui.json [17:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:33] (03CR) 10Volans: [C: 03+1] "No blocker for me, but I have no context on the ES side of thing." [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [17:52:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24445 and previous config saved to /var/cache/conftool/dbconfig/20220411-175240-ladsgroup.json [17:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:25] (03CR) 10Krinkle: [C: 04-1] "This is not intended as a global variable. Same as the other change, it's named after the directory. Feel free to name it $configDir thoug" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778667 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [18:04:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P24446 and previous config saved to /var/cache/conftool/dbconfig/20220411-180433-marostegui.json [18:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24447 and previous config saved to /var/cache/conftool/dbconfig/20220411-180745-ladsgroup.json [18:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:22] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: switch eqiad1 to use the new enc server [puppet] - 10https://gerrit.wikimedia.org/r/779049 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [18:14:28] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) Dell suggestion some alternate arguments for the command line utility that didn't work, and then requested we open a case for them to escalate Service Request 1090168698 Sent case # to our team... [18:15:37] (03PS1) 10Thiemo Kreuz (WMDE): Temporarily undeprecate EditPage::$textbox2 [core] (wmf/1.39.0-wmf.6) - 10https://gerrit.wikimedia.org/r/778641 (https://phabricator.wikimedia.org/T305028) [18:15:56] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: (C)60 le (W)70 le 70.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [18:18:22] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10KFrancis) Hi all, reconfirming as there is an MSA on file, we are covered. Thanks! [18:19:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T297189)', diff saved to https://phabricator.wikimedia.org/P24448 and previous config saved to /var/cache/conftool/dbconfig/20220411-181939-marostegui.json [18:19:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:19:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:45] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [18:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T297189)', diff saved to https://phabricator.wikimedia.org/P24449 and previous config saved to /var/cache/conftool/dbconfig/20220411-181947-marostegui.json [18:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:20] (03PS1) 10Herron: kafka-mirror: startup after kafka.service, shutdown before kafka.service [puppet] - 10https://gerrit.wikimedia.org/r/779086 (https://phabricator.wikimedia.org/T305652) [18:22:20] (03PS1) 10Jdlrobson: Enable sticky header edit button in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779087 (https://phabricator.wikimedia.org/T304072) [18:22:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24450 and previous config saved to /var/cache/conftool/dbconfig/20220411-182250-ladsgroup.json [18:22:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:22:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24451 and previous config saved to /var/cache/conftool/dbconfig/20220411-182258-ladsgroup.json [18:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:09] (03PS2) 10Zabe: Migrate $wmfConfigDir to $wmgConfigDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778667 (https://phabricator.wikimedia.org/T45956) [18:26:20] !log gitlab-runners: pausing runner-1011 in gitlab UI from accepting new jobs, then deleting instance in Horizon UI to replace it with another bullseye instance T297659 [18:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:23] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [18:26:36] (03PS3) 10Zabe: Migrate $wmfConfigDir to $configDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778667 (https://phabricator.wikimedia.org/T45956) [18:26:50] (03CR) 10Zabe: Migrate $wmfConfigDir to $configDir (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778667 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [18:27:02] (03Abandoned) 10Herron: sre.kafka.reboot-workers: add --skip-mirrormaker option [cookbooks] - 10https://gerrit.wikimedia.org/r/778325 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [18:34:04] (03CR) 10Andrew Bogott: [C: 03+2] "This is excellent cleanup -- thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah) [18:34:40] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: switch to ldap-rw naming on ldap hosts [puppet] - 10https://gerrit.wikimedia.org/r/779071 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [18:35:39] 10SRE, 10SRE-Access-Requests: Requesting access to google console for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T304502 (10jcrespo) Thank you, waiting for Tomek Sikora to confirm access to resolve. [18:40:03] (03PS1) 10Bartosz Dziewoński: Enable edit links in Vector sticky header on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779091 (https://phabricator.wikimedia.org/T305878) [18:40:20] (03PS4) 10Andrew Bogott: openstack: remove horizon access to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah) [18:42:57] would anyone like to merge a beta cluster config change for me, or should i schedule it for a backport window? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/779091 [18:43:55] jouncebot: nowandnext [18:43:55] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [18:43:55] In 1 hour(s) and 16 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220411T2000) [18:44:00] MatmaRex: looking [18:44:17] (03CR) 10Andrew Bogott: [C: 04-1] "I see some references to PUPPETMASTER_API and PUPPET_TABLE_MODE in the horizon code, so that needs cleaning up before we can merge this. l" [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah) [18:44:48] (03CR) 10Majavah: [C: 03+2] Enable edit links in Vector sticky header on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779091 (https://phabricator.wikimedia.org/T305878) (owner: 10Bartosz Dziewoński) [18:45:26] (03Merged) 10jenkins-bot: Enable edit links in Vector sticky header on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779091 (https://phabricator.wikimedia.org/T305878) (owner: 10Bartosz Dziewoński) [18:45:48] (03CR) 10Andrew Bogott: [C: 04-1] "ok, I imagine that's in https://gerrit.wikimedia.org/r/c/openstack/horizon/wmf-puppet-dashboard/+/778616 which I haven't read yet" [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah) [18:46:14] MatmaRex: pulled to deploy1002 but not syncing since it only touches a -labs.php file, it should make its way to beta within the next 30 mins or so [18:46:30] thanks taavi! [18:48:34] (03CR) 10Majavah: openstack: remove horizon access to puppetmaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah) [18:49:28] 10SRE, 10Performance-Team, 10Traffic: Enable HTTP compression for arclamp trace logs - https://phabricator.wikimedia.org/T305783 (10Krinkle) p:05Triage→03Medium a:03dpifke [18:52:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:52:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Andrew) FYI, @ayounsi, our mid-term goal is to eliminate the need for this hardware entirely. - Wikitech needs to move to the mediawiki cluste... [18:56:50] (03Abandoned) 10Jdlrobson: Enable sticky header edit button in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779087 (https://phabricator.wikimedia.org/T304072) (owner: 10Jdlrobson) [18:59:51] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T297652 (10Zabe) [19:00:02] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Bernard Wang - https://phabricator.wikimedia.org/T279014 (10Zabe) [19:00:07] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Clare Ming - https://phabricator.wikimedia.org/T278265 (10Zabe) [19:00:16] 10SRE, 10LDAP-Access-Requests: LDAP access for Till Mletzko - https://phabricator.wikimedia.org/T267744 (10Zabe) [19:00:27] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on sretest[1001-1002].eqiad.wmnet with reason: testing spicerack [19:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:29] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on sretest[1001-1002].eqiad.wmnet with reason: testing spicerack [19:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T297189)', diff saved to https://phabricator.wikimedia.org/P24452 and previous config saved to /var/cache/conftool/dbconfig/20220411-190257-marostegui.json [19:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:02] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [19:07:04] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:44] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: testing spicerack [19:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:46] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: testing spicerack [19:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:20] PROBLEM - Host rdb2008 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:43] !log gitlab - deleting runner-1011, creating new runner runner-1022 using bullseye [19:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:56] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:12:46] ACKNOWLEDGEMENT - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:17:16] !log runner-1022.gitlab-runners - rm -rf /var/lib/puppet/ssl ; run puppet; sign new request on gitlab-runners-puppetmaster-01.gitlab-runners (normal procedure needed when creating fresh instance in project with local puppetmaster) T297659 [19:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:20] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [19:17:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24453 and previous config saved to /var/cache/conftool/dbconfig/20220411-191738-ladsgroup.json [19:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:17:56] PROBLEM - SSH on wtp1035.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P24454 and previous config saved to /var/cache/conftool/dbconfig/20220411-191802-marostegui.json [19:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:38] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_ [19:26:50] (03PS1) 10Majavah: hieradata: fix value type for devtools [puppet] - 10https://gerrit.wikimedia.org/r/779095 [19:28:08] (03PS1) 10Ottomata: Add gmodena to analytics-research-admins for airflow access [puppet] - 10https://gerrit.wikimedia.org/r/779096 (https://phabricator.wikimedia.org/T305880) [19:29:39] (03CR) 10Ottomata: [C: 03+2] Add gmodena to analytics-research-admins for airflow access [puppet] - 10https://gerrit.wikimedia.org/r/779096 (https://phabricator.wikimedia.org/T305880) (owner: 10Ottomata) [19:31:39] (03PS3) 10Majavah: P:openldap: remove 'labs' branding [puppet] - 10https://gerrit.wikimedia.org/r/776191 (https://phabricator.wikimedia.org/T295150) [19:32:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24456 and previous config saved to /var/cache/conftool/dbconfig/20220411-193243-ladsgroup.json [19:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34778/console" [puppet] - 10https://gerrit.wikimedia.org/r/776191 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [19:33:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P24457 and previous config saved to /var/cache/conftool/dbconfig/20220411-193307-marostegui.json [19:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:21] (03PS1) 10Dzahn: gitlab_runner: solve race condition to to make things work on first run [puppet] - 10https://gerrit.wikimedia.org/r/779099 [19:43:33] (03CR) 10jerkins-bot: [V: 04-1] gitlab_runner: solve race condition to to make things work on first run [puppet] - 10https://gerrit.wikimedia.org/r/779099 (owner: 10Dzahn) [19:47:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24458 and previous config saved to /var/cache/conftool/dbconfig/20220411-194748-ladsgroup.json [19:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T297189)', diff saved to https://phabricator.wikimedia.org/P24459 and previous config saved to /var/cache/conftool/dbconfig/20220411-194812-marostegui.json [19:48:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:48:16] T297189: Schema change for dropping ft_title and ft_namespace - https://phabricator.wikimedia.org/T297189 [19:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:42] (03PS1) 10Cathal Mooney: Modify homer automation for IPv6 RAs to allow for custom interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/779100 (https://phabricator.wikimedia.org/T299758) [19:49:39] (03CR) 10jerkins-bot: [V: 04-1] Modify homer automation for IPv6 RAs to allow for custom interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/779100 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:52:08] (03PS2) 10Cathal Mooney: Modify homer automation for IPv6 RAs to allow for custom interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/779100 (https://phabricator.wikimedia.org/T299758) [19:52:54] (03CR) 10jerkins-bot: [V: 04-1] Modify homer automation for IPv6 RAs to allow for custom interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/779100 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:53:40] (03PS2) 10Dzahn: gitlab_runner: solve race condition to to make things work on first run [puppet] - 10https://gerrit.wikimedia.org/r/779099 [19:55:12] (03CR) 10Bking: elastic: don't wait for green on first node (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [19:56:15] (03PS3) 10Cathal Mooney: Modify homer automation for IPv6 RAs to allow for custom interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/779100 (https://phabricator.wikimedia.org/T299758) [19:57:37] (03CR) 10Cathal Mooney: [C: 03+2] Modify homer automation for IPv6 RAs to allow for custom interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/779100 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:58:17] (03Merged) 10jenkins-bot: Modify homer automation for IPv6 RAs to allow for custom interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/779100 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:00:04] RoanKattouw, Urbanecm, and cjming: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220411T2000). [20:00:04] zabe: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:28] hey zabe [20:00:29] around? [20:00:51] o/ [20:00:52] hey [20:01:41] zabe: ad https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/779078, may i know why those two wikis (why not testwiki instead, for example)? [20:02:02] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:02:03] (and i also want to double check https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/773650 isn't needed for that patch to work) [20:02:53] (03CR) 10Urbanecm: [C: 03+2] Migrate $wmfUsingKubernetes to $wmgUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:02:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24460 and previous config saved to /var/cache/conftool/dbconfig/20220411-200253-ladsgroup.json [20:02:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [20:02:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [20:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:59] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24461 and previous config saved to /var/cache/conftool/dbconfig/20220411-200301-ladsgroup.json [20:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:03] (03Merged) 10jenkins-bot: Migrate $wmfUsingKubernetes to $wmgUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776255 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:04:18] urbanecm, very pragmatic. Those two are the only ones with the new column. Hope thats fine? [20:04:30] zabe: oh, i thought we added it to all wikis :) [20:04:47] sure, that's good enough [20:05:28] the dba task is open. The only reason these two have the column, is that they are new (created after the db change got merged). [20:06:22] (03CR) 10Urbanecm: [C: 03+1] "verified those two wikis have the new column (while others don't)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779078 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:06:34] zabe: ah, makes sense. so we're testing early, basically [20:06:36] awards IRC barnstar to Zabe for working on a ticket from 2013 [20:06:42] * urbanecm awards a second one [20:07:08] zabe: `Migrate $wmfUsingKubernetes to $wmgUsingKubernetes` is now at mwdebug1001 if you can take a look? [20:07:24] "why are all the variables named after the foundation" [20:07:57] zabe: also, if you've some time after the deployments, I can have a look at T305014 too. fine if not, we can do it later. [20:07:58] T305014: Run PopulateCentralId on metawiki - https://phabricator.wikimedia.org/T305014 [20:08:27] :) [20:08:36] urbanecm, that would be cool, I have time [20:08:46] okay, let's do the deployments and then the script :) [20:08:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:59] let me know how the kubernetes patch is doing [20:09:39] urbanecm, lgtm [20:09:44] syncing [20:11:11] !log urbanecm@deploy1002 Synchronized wmf-config/: d4ff32f: Migrate $wmfUsingKubernetes to $wmgUsingKubernetes (T45956) (duration: 00m 53s) [20:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:15] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:11:16] and, it's live [20:11:21] (03PS3) 10Urbanecm: Stop writing to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:11:39] (03CR) 10Urbanecm: [C: 03+2] Stop writing to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:12:21] (03Merged) 10jenkins-bot: Stop writing to $wmfUsingKubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776256 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:12:31] (03PS1) 10Cathal Mooney: Remove IPv6 RA config on cr2-drmrs fxp0.0 [homer/public] - 10https://gerrit.wikimedia.org/r/779101 (https://phabricator.wikimedia.org/T299758) [20:12:53] zabe: pulled to mwdebug1001, but i doubt it's testable [20:13:31] (03CR) 10Cathal Mooney: [C: 03+2] Remove IPv6 RA config on cr2-drmrs fxp0.0 [homer/public] - 10https://gerrit.wikimedia.org/r/779101 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:13:59] urbanecm, yeah, I can confirm that it doesn't let the site explode, I don't think either that I can test more [20:14:09] in that case, syncing :) [20:14:25] (03PS2) 10Urbanecm: Start writing to cuc_actor in guwwiki and shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779078 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:14:32] (03Merged) 10jenkins-bot: Remove IPv6 RA config on cr2-drmrs fxp0.0 [homer/public] - 10https://gerrit.wikimedia.org/r/779101 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [20:15:31] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 8455fa0: Stop writing to $wmfUsingKubernetes (T45956) (duration: 00m 51s) [20:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:14] (03CR) 10Urbanecm: [C: 03+2] Start writing to cuc_actor in guwwiki and shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779078 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:17:07] (03Merged) 10jenkins-bot: Start writing to cuc_actor in guwwiki and shnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779078 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:17:16] (03PS3) 10Dzahn: gitlab_runner: solve race condition to to make things work on first run [puppet] - 10https://gerrit.wikimedia.org/r/779099 [20:17:46] zabe: pulled to mwdebug1001. i guess i'Ll need to help with this one, right? [20:17:56] (03CR) 10Dzahn: [C: 03+2] "https://phabricator.wikimedia.org/P24455" [puppet] - 10https://gerrit.wikimedia.org/r/779099 (owner: 10Dzahn) [20:18:26] zabe: i think that one can be tested by making an edit, checking the table and checking the CU interface, is that right? [20:19:04] RECOVERY - SSH on wtp1035.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:19:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:19:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:33] urbanecm, I did https://guw.wikipedia.org/w/index.php?title=Zinzant%E1%BB%8D:Zabe/Test&oldid=19230 . There should be an entry in cu_changes for that I guess. Could you check whether the actor id is correct? [20:20:10] cuc_user: 3 [20:20:10] cuc_user_text: Zabe [20:20:10] cuc_actor: 3 [20:20:13] sounds about right [20:22:17] urbanecm, same for shnwikivoyage. I guess if that looks good we can sync it, I will keep an eye on logstash, to make sure no fatals occur? [20:22:25] sounds good to me [20:22:49] cuc_user: 2 [20:22:49] cuc_user_text: Zabe [20:22:49] cuc_actor: 2 [20:22:52] this is shnwikivoyage [20:22:56] also looks correct [20:22:59] zabe: so, sync? [20:23:12] I would say :) [20:23:25] doing :) [20:24:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:24:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:40] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 17c8c17: Start writing to cuc_actor in guwwiki and shnwikivoyage (T233004) (duration: 00m 51s) [20:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:44] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:24:56] zabe: and, it's live [20:25:03] so i guess it's time for the script :)) [20:25:13] yes :) [20:26:15] zabe: do you have a guess about how quick it should be? [20:27:49] tbh, not really. The largest id is ~3.000.000 and there are currently ~200.000 entries, so it shouldn't take /that/ long [20:28:05] maybe an hour? [20:28:49] okay [20:28:52] let's hope :) [20:29:09] let me run it for ~100 rows first [20:29:29] like half of the current entries should already have the new column populated [20:29:42] that's cool [20:33:12] (03CR) 10Dzahn: "tested with new instance runner-1023. No more errors on first puppet run, it works right away with a single run now after applying profile" [puppet] - 10https://gerrit.wikimedia.org/r/779099 (owner: 10Dzahn) [20:35:17] (03CR) 10CDanis: [C: 03+2] Exclude WMF cloud IPs from generic cloud limiter [puppet] - 10https://gerrit.wikimedia.org/r/779074 (owner: 10BBlack) [20:35:45] zabe: I'm trying to get the script to write something, and i'm failing to. i ran `mwscript extensions/GlobalBlocking/maintenance/PopulateCentralId.php --wiki=metawiki --batch-size=100` with a break after the first batch (to be able to verify). it said `Completed migration, updated 1 row(s), migration failed for 0 row(s).` [20:35:59] but...there are no blocks with gb_id <= 100 [20:36:29] yeah, there are no blocks with gb_id <= 100, because expired global blocks get purged from the db [20:36:42] but why does it say it updated 1 row? [20:38:21] (03PS2) 10Phedenskog: grafana: double-proxy for performance JSON meta data [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) [20:38:41] ehm [20:38:42] zabe: i tried higher batch sizes too (enough to hit the lowest block ID of 4157), and while it still says it updated 1 row, it does run the update [20:39:08] (03PS3) 10Phedenskog: grafana: double-proxy for performance JSON meta data [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) [20:39:34] ok, actually I remember the update count to be wrong on beta aswell, it always said 70, while there are only like ~10 global blocks in beta [20:39:42] interesting [20:39:55] (03CR) 10Phedenskog: grafana: double-proxy for performance JSON meta data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778469 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [20:40:26] https://phabricator.wikimedia.org/P24462 is the updates i have [20:40:35] they look good to me [20:41:05] yep, value is correct [20:42:23] zabe: at least :). I'll run it in full then (unless you have any objections, of course). [20:42:40] no objections from me :) [20:42:56] running [20:43:34] i'm curious about the update count though (not that it's the most important part, it's really just curiosity) [20:45:19] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GlobalBlocking/maintenance/PopulateCentralId.php --wiki=metawiki # START, T305014, running in a tmux under my account at mwmaint1002 [20:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:22] T305014: Run PopulateCentralId on metawiki - https://phabricator.wikimedia.org/T305014 [20:46:22] urbanecm: no no [20:46:29] the new column is not in production yet [20:46:33] zabe: ^ [20:46:33] Amir1: it is [20:46:56] Amir1: see https://phabricator.wikimedia.org/P24462 [20:47:10] (script stopped) [20:47:16] I'm talking about cuc_actor [20:47:44] Amir1, the wikis are new and got created after the db patch [20:47:47] https://phabricator.wikimedia.org/T303603 [20:47:53] i see it there as well https://www.irccloud.com/pastebin/Ft5v42is/ [20:48:04] zabe: aaah [20:48:11] that's smart [20:48:20] okay then [20:48:22] but if you prefer to have the column unused until it's everywhere, i can revert the patch, no problem [20:48:24] ;) [20:48:33] * urbanecm was confused at first too [20:48:45] urbanecm: nah, it's fine. As long as it doesn't break the wiki [20:49:01] okay :). we tested that, fortunately. [20:49:10] ok to restart the PopulateCentralId script too? [20:49:13] ofc [20:49:28] thanks [20:49:50] I now 'abuse' those two wikis as testing environment, since there is no checkuser on beta ¯\_(ツ)_/¯ [20:50:38] PopulateCentralId restarted [20:58:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24463 and previous config saved to /var/cache/conftool/dbconfig/20220411-205844-ladsgroup.json [20:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:59:16] (BlazegraphJvmQuakeWarnGC) firing: (2) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [21:00:04] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220411T2100). [21:02:38] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GlobalBlocking/maintenance/PopulateCentralId.php --wiki=metawiki # END, T305014 [21:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:41] T305014: Run PopulateCentralId on metawiki - https://phabricator.wikimedia.org/T305014 [21:02:44] zabe: it finished. was quicker than i expected [21:02:54] ah [21:02:55] nice [21:03:15] zabe: do you need/want the script's output? or is the new DB content good enough? [21:03:56] maybe you could paste the output, but more importantly could double check that entries with gb_by_central_id = null are left? [21:04:08] * that no entries are left [21:04:22] `select gb_id from globalblocks where gb_by_central_id is null order by gb_id limit 1` returns no rows [21:04:40] awesome, thanks for your help :) [21:04:49] no problem [21:05:46] zabe: linked output from https://phabricator.wikimedia.org/T305014#7846543 and resolved the task :). lmk if anything more's necessary here [21:06:59] (03CR) 10Dzahn: [C: 03+1] "lgtm and all for it. especially like that command lines stay exactly the same. the only thing that keeps me from compiling and merging mys" [puppet] - 10https://gerrit.wikimedia.org/r/779040 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:07:55] yes :) [21:08:11] (03CR) 10Dzahn: [C: 03+2] hieradata: fix value type for devtools [puppet] - 10https://gerrit.wikimedia.org/r/779095 (owner: 10Majavah) [21:09:43] (03PS1) 10JHathaway: mx: reject email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/779128 (https://phabricator.wikimedia.org/T280472) [21:11:46] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34783/console" [puppet] - 10https://gerrit.wikimedia.org/r/779128 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [21:12:18] (03CR) 10Andrew Bogott: [C: 03+1] hieradata: fix value type for devtools [puppet] - 10https://gerrit.wikimedia.org/r/779095 (owner: 10Majavah) [21:13:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24466 and previous config saved to /var/cache/conftool/dbconfig/20220411-211350-ladsgroup.json [21:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:20] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:15:38] (03CR) 10Bking: elastic: don't wait for green on first node (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [21:15:54] (03CR) 10JHathaway: [V: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/34783/mx2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/779128 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [21:17:20] (03CR) 10JHathaway: [V: 03+2 C: 03+2] mx: reject email to legacy mailing list domains [puppet] - 10https://gerrit.wikimedia.org/r/779128 (https://phabricator.wikimedia.org/T280472) (owner: 10JHathaway) [21:20:31] (03CR) 10RLazarus: [C: 03+2] external_clouds_vendors: Support entity types besides "cloud" [puppet] - 10https://gerrit.wikimedia.org/r/777899 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [21:21:03] jhathaway: okay to merge yours? [21:21:09] yup, thanks [21:21:25] done [21:21:30] thanks [21:28:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24467 and previous config saved to /var/cache/conftool/dbconfig/20220411-212855-ladsgroup.json [21:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:31] (03CR) 10Dzahn: "compiled, ready to merge this but I would like someone around to confirm everything is working as expected after this major version change" [puppet] - 10https://gerrit.wikimedia.org/r/768774 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [21:42:29] (03CR) 10Andrew Bogott: [C: 03+2] openstack: remove horizon access to puppetmaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778551 (owner: 10Majavah) [21:44:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24468 and previous config saved to /var/cache/conftool/dbconfig/20220411-214400-ladsgroup.json [21:44:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [21:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [21:44:04] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24469 and previous config saved to /var/cache/conftool/dbconfig/20220411-214408-ladsgroup.json [21:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:02] PROBLEM - Host mw1334 is DOWN: PING CRITICAL - Packet loss = 100% [21:54:22] RECOVERY - Host mw1334 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [22:01:54] 10SRE, 10ConfirmEdit (CAPTCHA extension), 10MediaWiki-extensions-CentralAuth, 10Platform Engineering, and 6 others: Allow Stewards to enable 'emergency CAPTCHAs' for anonymous IP edits - https://phabricator.wikimedia.org/T303433 (10Zabe) [22:04:08] (03PS14) 10Bking: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) [22:29:03] 10SRE, 10SRE-OnFire, 10observability: Internationalization (i18n) & localization (l10n) of www.wikimediastatus.net - https://phabricator.wikimedia.org/T305896 (10CDanis) [22:35:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24470 and previous config saved to /var/cache/conftool/dbconfig/20220411-223530-ladsgroup.json [22:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:37] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:50:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24471 and previous config saved to /var/cache/conftool/dbconfig/20220411-225035-ladsgroup.json [22:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24472 and previous config saved to /var/cache/conftool/dbconfig/20220411-230540-ladsgroup.json [23:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:01] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) p:05Triage→03Medium [23:12:23] 10SRE, 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10Papaul) p:05Triage→03Medium a:03Papaul [23:20:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24473 and previous config saved to /var/cache/conftool/dbconfig/20220411-232045-ladsgroup.json [23:20:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [23:20:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [23:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:20:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24474 and previous config saved to /var/cache/conftool/dbconfig/20220411-232102-ladsgroup.json [23:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:20] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:38:34] (03CR) 10Krinkle: [C: 03+1] "Good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778667 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [23:44:18] RECOVERY - Host elastic2033 is UP: PING OK - Packet loss = 0%, RTA = 33.87 ms [23:47:56] 10SRE, 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10Papaul) 05Open→03Resolved Boot was set to UEFI for some reason. I changed it back to Legacy BIOS, system is back online [23:49:02] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Eevans) [23:49:37] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans)