[00:02:05] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:02:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24539 and previous config saved to /var/cache/conftool/dbconfig/20220413-000258-ladsgroup.json [00:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:01] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) @nskaggs @Andrew @aborrero @dcaro the goal for codfw is to consolidate all cloudx-dev nodes in a single rack see (T305469) and the racking... [00:18:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24540 and previous config saved to /var/cache/conftool/dbconfig/20220413-001803-ladsgroup.json [00:18:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [00:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [00:18:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24541 and previous config saved to /var/cache/conftool/dbconfig/20220413-001811-ladsgroup.json [00:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:11] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:25:41] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:27:57] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:44:36] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2033.codfw.wmnet with OS stretch [00:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:41] 10SRE, 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2033.codfw.wmnet with OS stretch [00:59:44] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2033.codfw.wmnet with reason: host reimage [00:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:11] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2033.codfw.wmnet with reason: host reimage [01:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24542 and previous config saved to /var/cache/conftool/dbconfig/20220413-011204-ladsgroup.json [01:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:23:29] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:23:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2033.codfw.wmnet with OS stretch [01:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:23] 10SRE, 10ops-codfw, 10Discovery: elastic2033 without bootable devices available (repeat of T281621) - https://phabricator.wikimedia.org/T305646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2033.codfw.wmnet with OS stretch completed: - elastic2033... [01:27:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24544 and previous config saved to /var/cache/conftool/dbconfig/20220413-012709-ladsgroup.json [01:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24545 and previous config saved to /var/cache/conftool/dbconfig/20220413-014214-ladsgroup.json [01:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24546 and previous config saved to /var/cache/conftool/dbconfig/20220413-015719-ladsgroup.json [01:57:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [01:57:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [01:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:24] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24547 and previous config saved to /var/cache/conftool/dbconfig/20220413-015727-ladsgroup.json [01:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:36:41] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:53:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24548 and previous config saved to /var/cache/conftool/dbconfig/20220413-025350-ladsgroup.json [02:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:01:31] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 66.52 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [03:08:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24549 and previous config saved to /var/cache/conftool/dbconfig/20220413-030855-ladsgroup.json [03:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:53] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 1.459 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [03:19:01] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs1012:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [03:20:37] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24550 and previous config saved to /var/cache/conftool/dbconfig/20220413-032400-ladsgroup.json [03:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:19] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:34:21] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:37:53] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:38:43] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:39:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24551 and previous config saved to /var/cache/conftool/dbconfig/20220413-033906-ladsgroup.json [03:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:39:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [03:39:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [03:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [03:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [03:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:27] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:12:45] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:27:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [04:27:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [04:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24552 and previous config saved to /var/cache/conftool/dbconfig/20220413-042723-ladsgroup.json [04:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:48:05] (03PS1) 10STran: Enable IP Info instrumentation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779579 (https://phabricator.wikimedia.org/T304438) [04:48:58] (03PS2) 10STran: Enable IP Info instrumentation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779579 (https://phabricator.wikimedia.org/T304438) [04:56:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2085:3311', diff saved to https://phabricator.wikimedia.org/P24553 and previous config saved to /var/cache/conftool/dbconfig/20220413-045646-root.json [04:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:18] (03PS1) 10Marostegui: Revert "db1138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/779114 [05:02:53] PROBLEM - Check systemd state on db2137 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb.service,prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:11] 10SRE, 10Wikimedia-Mailing-lists: hyperkitty didn't import all wikitech-l messages - https://phabricator.wikimedia.org/T281070 (10Legoktm) 05Open→03Resolved a:03Legoktm Unfortunately the very old archives (pre-2004) are not in a great shape just because of old Mailman bugs or some other unknown reasons.... [05:09:49] RECOVERY - Check systemd state on db2137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:49] (03CR) 10Marostegui: [C: 03+2] Revert "db1138: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/779114 (owner: 10Marostegui) [05:12:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 1%: After schema changes', diff saved to https://phabricator.wikimedia.org/P24554 and previous config saved to /var/cache/conftool/dbconfig/20220413-051238-root.json [05:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:31] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:25:31] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10jcrespo) @Dzahn I responded before I had the chance to read your comments. I didn't see explicit concerns about me proceeding (just hinting that in some cases they may not be needed). Given th... [05:32:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24556 and previous config saved to /var/cache/conftool/dbconfig/20220413-053248-ladsgroup.json [05:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:35:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:35:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1181 for reboot T306001', diff saved to https://phabricator.wikimedia.org/P24557 and previous config saved to /var/cache/conftool/dbconfig/20220413-053526-root.json [05:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:30] T306001: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T306001 [05:35:55] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:36:01] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 1%: After schema changes', diff saved to https://phabricator.wikimedia.org/P24558 and previous config saved to /var/cache/conftool/dbconfig/20220413-054422-root.json [05:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 1%: After reboot', diff saved to https://phabricator.wikimedia.org/P24559 and previous config saved to /var/cache/conftool/dbconfig/20220413-054443-root.json [05:44:43] (03PS1) 10Jcrespo: admin: Add Nat to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/779749 (https://phabricator.wikimedia.org/T305978) [05:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:12] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10jcrespo) p:05Triage→03High [05:47:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2130 db2088:3311', diff saved to https://phabricator.wikimedia.org/P24560 and previous config saved to /var/cache/conftool/dbconfig/20220413-054739-root.json [05:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24561 and previous config saved to /var/cache/conftool/dbconfig/20220413-054753-ladsgroup.json [05:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 5%: After schema changes', diff saved to https://phabricator.wikimedia.org/P24562 and previous config saved to /var/cache/conftool/dbconfig/20220413-055925-root.json [05:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 5%: After reboot', diff saved to https://phabricator.wikimedia.org/P24563 and previous config saved to /var/cache/conftool/dbconfig/20220413-055947-root.json [05:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:57] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10jcrespo) This seems to me like a reasonable requests, although as you point out, the details of how to exactly implement it to make... [06:02:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24564 and previous config saved to /var/cache/conftool/dbconfig/20220413-060258-ladsgroup.json [06:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:01] (BlazegraphJvmQuakeWarnGC) firing: (2) Blazegraph instance wdqs1012:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [06:14:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 10%: After schema changes', diff saved to https://phabricator.wikimedia.org/P24565 and previous config saved to /var/cache/conftool/dbconfig/20220413-061429-root.json [06:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P24566 and previous config saved to /var/cache/conftool/dbconfig/20220413-061451-root.json [06:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24567 and previous config saved to /var/cache/conftool/dbconfig/20220413-061803-ladsgroup.json [06:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:18:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [06:18:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [06:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24568 and previous config saved to /var/cache/conftool/dbconfig/20220413-061815-ladsgroup.json [06:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:49] (03PS1) 10Marostegui: db2072: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/779750 [06:21:32] (03CR) 10Marostegui: [C: 03+2] db2072: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/779750 (owner: 10Marostegui) [06:29:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: After schema changes', diff saved to https://phabricator.wikimedia.org/P24569 and previous config saved to /var/cache/conftool/dbconfig/20220413-062933-root.json [06:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P24570 and previous config saved to /var/cache/conftool/dbconfig/20220413-062955-root.json [06:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:34:37] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:44:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 50%: After schema changes', diff saved to https://phabricator.wikimedia.org/P24571 and previous config saved to /var/cache/conftool/dbconfig/20220413-064437-root.json [06:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P24572 and previous config saved to /var/cache/conftool/dbconfig/20220413-064459-root.json [06:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: After schema changes', diff saved to https://phabricator.wikimedia.org/P24573 and previous config saved to /var/cache/conftool/dbconfig/20220413-065941-root.json [06:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P24574 and previous config saved to /var/cache/conftool/dbconfig/20220413-070002-root.json [07:00:04] Amir1, awight, Urbanecm, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220413T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:23] * kart_ is here. [07:00:24] o/ [07:00:28] kart_: do you want to self deploy? [07:00:43] taavi: yeah. will self-deploy.. [07:00:43] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 120 probes of 677 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:01:02] (03PS2) 10KartikMistry: Add SectionTranslation entry points as campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778381 (https://phabricator.wikimedia.org/T298029) [07:02:30] (03CR) 10KartikMistry: [C: 03+2] Add SectionTranslation entry points as campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778381 (https://phabricator.wikimedia.org/T298029) (owner: 10KartikMistry) [07:02:43] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 46 probes of 760 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:03:16] (03Merged) 10jenkins-bot: Add SectionTranslation entry points as campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/778381 (https://phabricator.wikimedia.org/T298029) (owner: 10KartikMistry) [07:07:17] Deploying.. [07:08:03] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:778381|Add SectionTranslation entry points as campaigns (T298029)]] (duration: 01m 03s) [07:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:07] T298029: Enable Content Translation beta feature for a user when accessing a Section Translation entry point on mobile - https://phabricator.wikimedia.org/T298029 [07:10:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:10:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:47] taavi: done. [07:14:01] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 31 probes of 760 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:14:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: After schema changes', diff saved to https://phabricator.wikimedia.org/P24575 and previous config saved to /var/cache/conftool/dbconfig/20220413-071445-root.json [07:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1181 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P24576 and previous config saved to /var/cache/conftool/dbconfig/20220413-071506-root.json [07:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24577 and previous config saved to /var/cache/conftool/dbconfig/20220413-071524-ladsgroup.json [07:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:17:51] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 61 probes of 677 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:30:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24578 and previous config saved to /var/cache/conftool/dbconfig/20220413-073029-ladsgroup.json [07:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2072 db2085:3311', diff saved to https://phabricator.wikimedia.org/P24579 and previous config saved to /var/cache/conftool/dbconfig/20220413-073119-root.json [07:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:37] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:45:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24580 and previous config saved to /var/cache/conftool/dbconfig/20220413-074534-ladsgroup.json [07:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] dancy and jnuche: Dear deployers, time to do the MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220413T0800). [08:00:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24581 and previous config saved to /var/cache/conftool/dbconfig/20220413-080040-ladsgroup.json [08:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [08:00:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:00:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [08:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [08:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [08:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:22] 10SRE, 10Infrastructure-Foundations, 10netops: Cannot verify NTP status asw1-b12-drmrs - https://phabricator.wikimedia.org/T305840 (10ayounsi) I had a quick look as well, but didn't make any progress. I tried to bounce NTP with: `lang=diff [edit system] + processes { + ntp disable; + } ! inacti... [08:32:03] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10dr0ptp4kt) Thanks all. This is all good and well. Thank you for the support and discussion! The access to some of the things around observability and metrics is part of th... [08:39:31] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:41:02] !log ayounsi@cumin2002 START - Cookbook sre.network.cf [08:41:02] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:25] !log ayounsi@cumin2002 START - Cookbook sre.network.cf [08:41:27] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:56] !log imported scap 4.6.1 to stretch-/buster-/bullseye-wikimedia - T305949 [08:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:59] T305949: Deploy Scap version 4.6.1 - https://phabricator.wikimedia.org/T305949 [08:44:01] (BlazegraphJvmQuakeWarnGC) firing: (2) Blazegraph instance wdqs1012:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [08:44:07] !log jayme@deploy1002 Started deploy [restbase/deploy@627f7d7] (dev-cluster): (no justification provided) [08:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:48] !log jayme@deploy1002 Finished deploy [restbase/deploy@627f7d7] (dev-cluster): (no justification provided) (duration: 02m 41s) [08:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [08:47:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [08:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24582 and previous config saved to /var/cache/conftool/dbconfig/20220413-084749-ladsgroup.json [08:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:49:01] (BlazegraphJvmQuakeWarnGC) resolved: (2) Blazegraph instance wdqs1012:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [08:49:54] (03CR) 10Ayounsi: "Had a chat on IRC, that RA for fxp0 seems like a leftover from the factory config or a miss-config when setting up the routers." [homer/public] - 10https://gerrit.wikimedia.org/r/779100 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [08:50:05] (03PS4) 10JMeybohm: Add all members of the ops group to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/779047 (https://phabricator.wikimedia.org/T305729) [08:50:31] (03PS4) 10JMeybohm: Switch default group for Kubernetes credentials files to deployment [puppet] - 10https://gerrit.wikimedia.org/r/779048 (https://phabricator.wikimedia.org/T305729) [08:52:58] (03PS1) 10DCausse: team-search-platform: remove BlazegraphJvmQuakeWarnGC [alerts] - 10https://gerrit.wikimedia.org/r/779831 (https://phabricator.wikimedia.org/T293862) [09:12:29] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [09:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:00] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [09:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:29] (03CR) 10Ayounsi: [C: 03+1] "easy" [software/spicerack] - 10https://gerrit.wikimedia.org/r/779561 (owner: 10Volans) [09:20:09] (03CR) 10Volans: [C: 03+2] yaml files: fix indentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/779561 (owner: 10Volans) [09:21:28] !log jnuche@deploy1002 Started deploy [restbase/deploy@627f7d7] (dev-cluster): (no justification provided) [09:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:10] (03CR) 10Ayounsi: WIP move core routers definitions to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:24:19] !log jnuche@deploy1002 Finished deploy [restbase/deploy@627f7d7] (dev-cluster): (no justification provided) (duration: 02m 51s) [09:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:16] (03Merged) 10jenkins-bot: yaml files: fix indentation [software/spicerack] - 10https://gerrit.wikimedia.org/r/779561 (owner: 10Volans) [09:33:33] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:37:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Before merging, got a PCC?" [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [09:42:42] (03PS1) 10Btullis: Ensure that the datahub consumers use TLS where required [deployment-charts] - 10https://gerrit.wikimedia.org/r/779837 (https://phabricator.wikimedia.org/T301454) [09:43:20] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [09:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24585 and previous config saved to /var/cache/conftool/dbconfig/20220413-094341-ladsgroup.json [09:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:44:47] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [09:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:07] (03PS1) 10Alexandros Kosiaris: Revert "zotero: Disable paging" [puppet] - 10https://gerrit.wikimedia.org/r/779118 (https://phabricator.wikimedia.org/T291707) [09:45:29] (03PS2) 10Alexandros Kosiaris: Revert "zotero: Disable paging" [puppet] - 10https://gerrit.wikimedia.org/r/779118 (https://phabricator.wikimedia.org/T291707) [09:51:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "zotero: Disable paging" [puppet] - 10https://gerrit.wikimedia.org/r/779118 (https://phabricator.wikimedia.org/T291707) (owner: 10Alexandros Kosiaris) [09:58:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24586 and previous config saved to /var/cache/conftool/dbconfig/20220413-095846-ladsgroup.json [09:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:15] RECOVERY - Host analytics1076 is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [10:07:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34811/console" [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [10:07:49] PROBLEM - puppet last run on analytics1077 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:08:16] (03CR) 10Majavah: [V: 03+1] P:toolforge::prometheus: simplify prometheus config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [10:09:27] (03CR) 10Btullis: [C: 03+2] Ensure that the datahub consumers use TLS where required [deployment-charts] - 10https://gerrit.wikimedia.org/r/779837 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:12:22] (03CR) 10jerkins-bot: [V: 04-1] Ensure that the datahub consumers use TLS where required [deployment-charts] - 10https://gerrit.wikimedia.org/r/779837 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:13:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24587 and previous config saved to /var/cache/conftool/dbconfig/20220413-101351-ladsgroup.json [10:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:54] (03PS2) 10Btullis: Ensure that the datahub consumers use TLS where required [deployment-charts] - 10https://gerrit.wikimedia.org/r/779837 (https://phabricator.wikimedia.org/T301454) [10:21:41] (03PS1) 10Btullis: Add an A record for datahub.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) [10:28:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24588 and previous config saved to /var/cache/conftool/dbconfig/20220413-102856-ladsgroup.json [10:28:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [10:28:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [10:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24589 and previous config saved to /var/cache/conftool/dbconfig/20220413-102904-ladsgroup.json [10:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:22] (03PS1) 10Btullis: Add a trafficserver backend mapping rule for datahub [puppet] - 10https://gerrit.wikimedia.org/r/779840 (https://phabricator.wikimedia.org/T303049) [10:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:33:54] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:36:02] RECOVERY - puppet last run on analytics1077 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:36:06] (03PS2) 10Btullis: Add an A record for datahub.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) [10:40:25] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [10:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:45] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [10:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:50] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [10:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:18] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [10:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:31] (03PS1) 10Volans: mediawiki: call siteinfo in HTTPS [software/spicerack] - 10https://gerrit.wikimedia.org/r/779841 [10:46:02] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [10:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:20] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:46:21] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [10:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:59] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10Volans) > let them run "puppet disable/enable" either directly or with a wrapper around it. (the one used by cumin?). Nobody shoul... [11:21:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24590 and previous config saved to /var/cache/conftool/dbconfig/20220413-112140-ladsgroup.json [11:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:36:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24591 and previous config saved to /var/cache/conftool/dbconfig/20220413-113645-ladsgroup.json [11:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:51] (03PS1) 10Cathal Mooney: Remove config/var for defining bespoke interfaces for IPv6 RAs [homer/public] - 10https://gerrit.wikimedia.org/r/779844 (https://phabricator.wikimedia.org/T299758) [11:38:04] !log gmodena@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [11:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:11] !log gmodena@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 00m 07s) [11:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:40] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Request to add user gmodena to analytics-research-admins group - https://phabricator.wikimedia.org/T305880 (10gmodena) >>! In T305880#7848648, @jcrespo wrote: > @gmodena Did the access work? Hey @jcrespo, I tried a deployment that failed with: ` airflow-dags... [11:40:08] !log Remove IPv6 router-advertisement config for fxp0 management interface on cr1-drmrs. [11:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:54] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:46:47] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10jcrespo) [11:46:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster [11:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24592 and previous config saved to /var/cache/conftool/dbconfig/20220413-115151-ladsgroup.json [11:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24593 and previous config saved to /var/cache/conftool/dbconfig/20220413-120656-ladsgroup.json [12:06:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [12:06:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [12:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:01] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24594 and previous config saved to /var/cache/conftool/dbconfig/20220413-120704-ladsgroup.json [12:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:33] (03PS1) 10Hnowlan: Set production role and add config for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/779846 [12:25:45] (03CR) 10Tchanders: [C: 03+1] Enable IP Info instrumentation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779579 (https://phabricator.wikimedia.org/T304438) (owner: 10STran) [12:43:08] (03CR) 10Ayounsi: [C: 03+1] Remove config/var for defining bespoke interfaces for IPv6 RAs [homer/public] - 10https://gerrit.wikimedia.org/r/779844 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:44:06] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:48:52] (03CR) 10Ayounsi: [C: 03+1] mediawiki: call siteinfo in HTTPS [software/spicerack] - 10https://gerrit.wikimedia.org/r/779841 (owner: 10Volans) [12:55:39] (03CR) 10Ottomata: "Hmmm, what do you think about using a more generic name for the public URL, rather than one associated with the tech?" [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [12:57:40] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Request to add user gmodena to analytics-research-admins group - https://phabricator.wikimedia.org/T305880 (10Ottomata) 05Open→03Resolved a:03Ottomata The access works though! We'll figure out the deployment issues separately. [12:57:46] (03PS1) 10Volans: setup.py: add missing types for requests [software/homer] - 10https://gerrit.wikimedia.org/r/779849 [12:57:50] (03PS1) 10Volans: capirca: catch also requests exceptions [software/homer] - 10https://gerrit.wikimedia.org/r/779850 [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220413T1300). [13:00:05] zabe and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:21] o/ [13:00:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24595 and previous config saved to /var/cache/conftool/dbconfig/20220413-130050-ladsgroup.json [13:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:56] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:04:47] !log installed spicerack v2.4.1 on cumin2002 [13:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:09] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on sretest[1001-1002].eqiad.wmnet with reason: testing spicerack [13:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:13] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest[1001-1002].eqiad.wmnet with reason: testing spicerack [13:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:15] (03PS1) 10Ladsgroup: Set templatelinks migration schema to write both in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779852 (https://phabricator.wikimedia.org/T299421) [13:12:21] jouncebot: nowandnext [13:12:21] For the next 0 hour(s) and 47 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220413T1300) [13:12:21] In 0 hour(s) and 47 minute(s): Maintenance script run (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220413T1400) [13:12:54] I’m in a meeting, sorry [13:12:58] can’t deploy yet [13:13:14] !log otto@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [13:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:31] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955 [13:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:37] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [13:13:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955 [13:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:44] (03PS3) 10Reedy: Use namespaced GerritExtDistProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774963 [13:13:48] !log otto@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 00m 34s) [13:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:01] (03CR) 10Reedy: [C: 03+2] "ship it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774963 (owner: 10Reedy) [13:14:40] I'm outside so my access is a bit limited [13:14:46] Lucas_WMDE: Hi! Will you be deploying later this window? (No worries if not - I can reschedule) [13:14:58] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin1001 - T301955 [13:15:00] (03Merged) 10jenkins-bot: Use namespaced GerritExtDistProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774963 (owner: 10Reedy) [13:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:05] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin1001 - T301955 [13:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24596 and previous config saved to /var/cache/conftool/dbconfig/20220413-131555-ladsgroup.json [13:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:24] let me see if I can do it [13:16:31] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955 [13:16:35] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: Use namespaced GerritExtDistProvider (duration: 00m 55s) [13:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:58] Amir1: I'm at home etc [13:16:59] * Reedy looks [13:17:23] if you can do it, it'd be awesome [13:17:45] and once done https://gerrit.wikimedia.org/r/779852 as well :D [13:17:56] (03PS3) 10Reedy: Enable IP Info instrumentation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779579 (https://phabricator.wikimedia.org/T304438) (owner: 10STran) [13:17:59] but it's just a switch flip, it should be fine [13:17:59] (03CR) 10Reedy: [C: 03+2] Enable IP Info instrumentation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779579 (https://phabricator.wikimedia.org/T304438) (owner: 10STran) [13:18:14] Amir1, Reedy: Thanks. I need to attend another training since it's been a while, but they're all outside my hours currently... [13:18:27] Not much has changed... :) [13:18:49] Maybe just my memory/confidence... [13:19:02] (03Merged) 10jenkins-bot: Enable IP Info instrumentation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779579 (https://phabricator.wikimedia.org/T304438) (owner: 10STran) [13:19:23] (03PS4) 10Reedy: Migrate $wmfUdp2logDest to $wmgUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776258 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:19:34] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955 [13:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:19:37] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [13:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:53] Tchanders: Do you care about testing it on a mwdebug host? Or shall I just sync it out as it's only for testwiki? [13:20:16] Reedy: Would you mind if I test? We managed to break beta with a similar patch in the past [13:20:21] heh [13:20:23] yeah, that's fine [13:20:25] moment [13:21:04] Tchanders: it's on mwdebug1002 [13:21:12] Testing... [13:22:09] question, if I run into something that may be a recent bug from a train deployment, marking it as wm-production-error is enough to flag it, right? [13:22:19] Reedy: Looks good - thank you [13:22:25] 10SRE, 10Infrastructure-Foundations, 10netops: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 (10cmooney) 05Open→03Resolved [13:22:39] jynus: Not usually AFAIK. You can mark it as a blocker of the deployment task [13:23:07] ok, that is the part I am unsure about- how to know if it is a blocker or a regular bug? [13:23:22] If you're not sure, file it as a blocker. It'll guarantee it gets triaged [13:23:26] ok [13:23:31] will do [13:23:34] better safe than sorry etc [13:24:07] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T304438 (duration: 01m 03s) [13:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:11] T304438: Enable IP Info instrumentation in testwiki - https://phabricator.wikimedia.org/T304438 [13:24:16] (03CR) 10Reedy: [C: 03+2] Migrate $wmfUdp2logDest to $wmgUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776258 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:24:25] (03CR) 10Btullis: Add an A record for datahub.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [13:24:37] ah, I think a team filed a duplicate and is aware already, so that0s ok [13:24:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:24:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:00] (03Merged) 10jenkins-bot: Migrate $wmfUdp2logDest to $wmgUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776258 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:25:15] (03PS2) 10Reedy: Set templatelinks migration schema to write both in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779852 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [13:27:00] !log reedy@deploy1002 Synchronized wmf-config/: Migrate $wmfUdp2logDest to $wmgUdp2logDest - T45956 (duration: 00m 55s) [13:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:06] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [13:27:34] (03CR) 10Ottomata: [C: 03+1] "Okay" [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [13:27:41] thanks Reedy [13:28:09] (03CR) 10Reedy: [C: 03+2] Set templatelinks migration schema to write both in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779852 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [13:28:18] (03PS3) 10Zabe: Stop writing to $wmfUdp2logDest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776259 (https://phabricator.wikimedia.org/T45956) [13:29:02] (03Merged) 10jenkins-bot: Set templatelinks migration schema to write both in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779852 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [13:30:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:48] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Set templatelinks migration schema to write both in s4 - T299421 (duration: 00m 55s) [13:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:52] T299421: Turn on write both in production for templatelinks normalization - https://phabricator.wikimedia.org/T299421 [13:31:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24597 and previous config saved to /var/cache/conftool/dbconfig/20220413-133100-ladsgroup.json [13:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:55] Thanks Reedy [13:33:01] (03PS2) 10Zabe: Write the same value to wmgSwiftConfig as to wmfSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768259 (https://phabricator.wikimedia.org/T45956) [13:33:59] !log installed spicerack v2.4.1 on cumin1001 [13:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:03] !log milimetric@deploy1002 Started deploy [analytics/refinery@34be9f3] (thin): Regular analytics weekly train THIN [analytics/refinery@34be9f3] [13:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:11] !log milimetric@deploy1002 Finished deploy [analytics/refinery@34be9f3] (thin): Regular analytics weekly train THIN [analytics/refinery@34be9f3] (duration: 00m 07s) [13:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:35:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:16] alright, I’m back… anything still needs to be deployed? ^^ [13:38:30] no [13:38:37] alright [13:38:43] then I’ll just wait until the next window starts [13:45:05] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:46:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24598 and previous config saved to /var/cache/conftool/dbconfig/20220413-134605-ladsgroup.json [13:46:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [13:46:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [13:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24599 and previous config saved to /var/cache/conftool/dbconfig/20220413-134613-ladsgroup.json [13:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:32] (03PS1) 10Zabe: Migrate $wmfSwiftConfig to $wmgSwiftConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779856 (https://phabricator.wikimedia.org/T45956) [13:58:22] !log restarting bacula hosts [13:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:36] (03PS1) 104nn1l2: fawiki: Change logo for 900K milestone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779858 (https://phabricator.wikimedia.org/T306030) [13:58:37] ^backups will be unavailable for some minutes [13:59:32] (03CR) 10Bking: [C: 03+2] elastic: allow waiting for yellow instead of green [cookbooks] - 10https://gerrit.wikimedia.org/r/778335 (https://phabricator.wikimedia.org/T304570) (owner: 10Ryan Kemper) [14:00:04] Lucas_WMDE and hoo: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Maintenance script run. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220413T1400). [14:00:35] o/ [14:00:45] alright, let’s go [14:01:00] (03PS1) 10Zabe: wikitech_private: convert to new array syntax [puppet] - 10https://gerrit.wikimedia.org/r/779860 [14:05:31] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php testwiki [14:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:23] !log otto@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [14:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:27] !log otto@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 00m 04s) [14:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:15] !log lucaswerkmeister-wmde@mwmaint1002:~$ foreachwikiindblist wikidataclient-test extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php [14:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:51] Was something deployed and reverted to whatever cluster ukwiki is in? https://phabricator.wikimedia.org/T306033 [14:08:45] (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:24] Base: not today as far as I’m aware / can see in the SAL [14:09:41] ukwiki is in group2, so it wouldn’t be affected by the train yet [14:11:38] interesting [14:12:24] (03PS1) 10Lucas Werkmeister (WMDE): Use "unexpectedUnconnectedPage" page prop on wikidataclient-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779861 [14:13:18] Base: note that doesn't mean you were wrong- there are many things "on the fly" (browser's cache, cdn's cache). Site notice I belive is js heavy, which adds to weirdness [14:14:14] ask if someone from the community see it wrong now, and if not, you can close the ticket :-) [14:14:43] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use "unexpectedUnconnectedPage" page prop on wikidataclient-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779861 (owner: 10Lucas Werkmeister (WMDE)) [14:14:54] Sitenotice, unlike Centralnotice isn't that JS heavy I think [14:15:37] ah, sorry, I mixed those [14:15:38] (03Merged) 10jenkins-bot: Use "unexpectedUnconnectedPage" page prop on wikidataclient-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779861 (owner: 10Lucas Werkmeister (WMDE)) [14:15:49] but still- could be a job that took more than usual, etc. [14:16:18] sitenotice also can take some time until it is updated on all caching layers, but usually not /that/ long [14:17:46] Well it is not a new one too, it was placed on March 6 [14:17:58] Having links render as self-link is a weird thing too [14:18:09] yeah, that I agree [14:18:35] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:779861|Use "unexpectedUnconnectedPage" page prop on wikidataclient-test]] (duration: 00m 55s) [14:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:21:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:23] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955 [14:23:23] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955 [14:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:26] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [14:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:47] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --last-page-id 10000000 [14:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:01] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955 [14:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:06] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955 [14:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:25] lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --first-page-id 10000001 --last-page-id 20000000 [14:31:03] oops, forgot the log [14:31:07] well, that’s done now [14:31:09] !log lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --first-page-id 10000001 --last-page-id 20000000 [14:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:26] !log bacula restarts finished [14:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:32] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --first-page-id 20000001 --last-page-id 30000000 [14:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:33:45] (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:07] (03PS1) 10Stang: Optimize logo for Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779865 (https://phabricator.wikimedia.org/T306037) [14:36:15] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --batch-size 500 --first-page-id 30000001 --last-page-id 40000000 [14:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:53] (03PS1) 10Ottomata: Declare new deployer groups for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 [14:39:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24600 and previous config saved to /var/cache/conftool/dbconfig/20220413-143948-ladsgroup.json [14:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:40:22] (03CR) 10Bking: [V: 03+2] wdqs: activate jvmquake at 300:5 [puppet] - 10https://gerrit.wikimedia.org/r/779440 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [14:40:30] (03CR) 10jerkins-bot: [V: 04-1] Declare new deployer groups for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [14:40:33] (03CR) 10Bking: [V: 03+2 C: 03+2] wdqs: activate jvmquake at 300:5 [puppet] - 10https://gerrit.wikimedia.org/r/779440 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [14:40:50] (03CR) 10JHathaway: mx: use $domain_data rather than $domain for aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779504 (https://phabricator.wikimedia.org/T305962) (owner: 10JHathaway) [14:41:06] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --batch-size 500 --first-page-id 40000001 --last-page-id 50000000 [14:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:44] (03CR) 10Andrew Bogott: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:43:32] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Addshore) @Joe (Also pinging @akosiaris as I know joe is out right now). It seems like the ideal solution of {T23939... [14:46:21] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --batch-size 500 --first-page-id 50000001 --last-page-id 60000000 [14:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:15] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --batch-size 500 --first-page-id 60000001 --last-page-id 70000000 [14:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:14] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:54:17] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --batch-size 500 --first-page-id 70000001 --last-page-id 80000000 [14:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24601 and previous config saved to /var/cache/conftool/dbconfig/20220413-145453-ladsgroup.json [14:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:12] (03PS2) 10Jcrespo: admin: Add Nat to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/779749 (https://phabricator.wikimedia.org/T305978) [14:58:07] (03CR) 10Jcrespo: [C: 03+2] admin: Add Nat to the list of privileged ldap users [puppet] - 10https://gerrit.wikimedia.org/r/779749 (https://phabricator.wikimedia.org/T305978) (owner: 10Jcrespo) [14:58:30] (03CR) 10Dzahn: [C: 03+1] "checked in LDAP, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/779749 (https://phabricator.wikimedia.org/T305978) (owner: 10Jcrespo) [14:58:41] (03PS2) 10Ottomata: Declare new deployer groups for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 [14:58:45] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --batch-size 500 --first-page-id 80000001 --last-page-id 90000000 [14:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:28] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10akosiaris) >>! In T238751#7851690, @Addshore wrote: > @Joe (Also pinging @akosiaris as I know joe is out right now).... [14:59:42] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10Dzahn) >>! In T305978#7850466, @jcrespo wrote: > @Dzahn I responded before I had the chance to read your comments. I didn't see explicit concerns about me proceeding (just... [15:00:16] (03CR) 10jerkins-bot: [V: 04-1] Declare new deployer groups for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [15:00:18] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [15:03:08] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --batch-size 500 --first-page-id 90000001 --last-page-id 100000000 [15:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:12] (03PS3) 10Ottomata: Declare new deployer groups for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 [15:04:05] (03PS4) 10Ottomata: Declare new deployer groups for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 [15:05:45] (03CR) 10jerkins-bot: [V: 04-1] Declare new deployer groups for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [15:06:45] (03PS5) 10Ottomata: Declare new research-deployers group for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 [15:07:18] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --batch-size 500 --first-page-id 100000001 --last-page-id 110000000 [15:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:58] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34816/console" [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [15:08:13] (03CR) 10Vivian Rook: [C: 03+2] add chunkeddriver.py.patch to wallaby [puppet] - 10https://gerrit.wikimedia.org/r/777873 (https://phabricator.wikimedia.org/T304694) (owner: 10Vivian Rook) [15:08:24] (03CR) 10jerkins-bot: [V: 04-1] Declare new research-deployers group for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [15:08:35] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10jcrespo) @NHillard-WMF Access deployed- you can test it works for you on gerrit, or any of the other services granted? https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups#... [15:09:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24602 and previous config saved to /var/cache/conftool/dbconfig/20220413-150959-ladsgroup.json [15:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:08] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php commonswiki --batch-size 500 --first-page-id 110000001 [15:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:30] (03PS6) 10Ottomata: Declare new research-deployers group for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 [15:15:38] (03PS7) 10Ottomata: Declare new research-deployers group for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 [15:17:17] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34819/console" [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [15:18:43] (03PS8) 10Ottomata: Declare new research-deployers group for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 [15:21:14] (03CR) 10Ottomata: [C: 03+2] Declare new research-deployers group for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/779887 (owner: 10Ottomata) [15:23:26] !log otto@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [15:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:36] !log otto@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 00m 10s) [15:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24603 and previous config saved to /var/cache/conftool/dbconfig/20220413-152504-ladsgroup.json [15:25:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:25:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:12] PROBLEM - Host mw1308 is DOWN: PING CRITICAL - Packet loss = 100% [15:29:58] RECOVERY - Host mw1308 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [15:31:33] 10SRE, 10ops-eqiad: mw1308 - internal IPMI error - mgmt / DRAC problem - https://phabricator.wikimedia.org/T305741 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Fixed [15:32:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudstore1010.wikimedia.org with OS bullseye [15:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudstore1010.wikimedia.or... [15:37:09] !log otto@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [15:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:13] !log otto@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 00m 03s) [15:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Cmjohnson) @volans @Papaul I get this during the install. This requires a manual entry [ (1*installer) 2 shell 3... [15:41:22] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) I'm unsure what else I need to do now to make this new service available. I've successfully deployed the service to staging, eqiad and codfw using `he... [15:45:19] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudstore1010.wikimedia.org with OS bullseye [15:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:24] (03PS1) 10Ottomata: Bounce keyholder-proxy when keyholder-auth.d group -> key mapping changes [puppet] - 10https://gerrit.wikimedia.org/r/779897 [15:45:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudstore1010.wikimedia.org wi... [15:45:57] (03PS1) 10Btullis: Update datahub to use version 0.8.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/779898 (https://phabricator.wikimedia.org/T306019) [15:46:15] (03PS1) 10Majavah: openstack: make enc-cli authenticate via keystone [puppet] - 10https://gerrit.wikimedia.org/r/779899 (https://phabricator.wikimedia.org/T274666) [15:47:23] (03CR) 10Ottomata: "I'll try to check that this works in a few weeks when I have to add another deployers group for platform eng." [puppet] - 10https://gerrit.wikimedia.org/r/779897 (owner: 10Ottomata) [15:47:51] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mc2023.codfw.wmnet with reason: moving to a different rack [15:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:53] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mc2023.codfw.wmnet with reason: moving to a different rack [15:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:58] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:47:59] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=575a5fd0-668b-41f6-8ab3-5ff749f54ac7) set by akosiaris@cumin1001 for 2 days, 0:00:00 on 1 host(... [15:48:01] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubestage2002.codfw.wmnet with reason: moving to a different rack [15:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:03] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubestage2002.codfw.wmnet with reason: moving to a different rack [15:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:09] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=60f8ccbd-38ba-4b65-aadf-f44a7fc83c9e) set by akosiaris@cumin1001 for 2 days, 0:00:00 on 1 host(... [15:49:15] (03PS1) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) [15:49:51] (03CR) 10jerkins-bot: [V: 04-1] swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [15:49:57] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10akosiaris) @Papaul: mc2023 and kubestage2002 have been downtimed again (for 2days) and I 've just powered them off. The should be ready to be moved. [15:50:39] (03PS1) 10Zabe: webperf: migrate warm_up_coal_cache cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) [15:50:41] (03PS1) 10Zabe: webperf: remove absented warm_up_coal_cache cron [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) [15:51:03] !log Ran extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of enwiki (for 5M pages each). [15:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:14] (03CR) 10jerkins-bot: [V: 04-1] webperf: migrate warm_up_coal_cache cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:51:35] (03CR) 10jerkins-bot: [V: 04-1] webperf: remove absented warm_up_coal_cache cron [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:52:11] (03PS2) 10Zabe: webperf: migrate warm_up_coal_cache cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) [15:52:58] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of wikidatawiki (for 5M pages each). [15:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:35] (03PS2) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) [15:54:45] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:54:58] (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:55:15] (03CR) 10jerkins-bot: [V: 04-1] swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [15:57:25] (03PS2) 10Zabe: webperf: remove absented warm_up_coal_cache cron [puppet] - 10https://gerrit.wikimedia.org/r/779902 (https://phabricator.wikimedia.org/T273673) [15:57:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Cmjohnson) a:05Cmjohnson→03nskaggs These servers fail partman, it appears that the installer is looking for an answer that i... [15:57:42] (03PS3) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) [15:59:23] (03CR) 10jerkins-bot: [V: 04-1] swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [15:59:58] (KubernetesCalicoDown) firing: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:01:28] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of frwiki (for 5M pages each). [16:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:36] (03PS4) 10MVernon: swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) [16:02:45] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw1308.eqiad.wmnet [16:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:52] 10SRE, 10ops-eqiad: mw1308 - internal IPMI error - mgmt / DRAC problem - https://phabricator.wikimedia.org/T305741 (10Dzahn) Thank you, Chris. - server repooled [16:04:28] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Also good syntax examples and I learnt what typing stubs were, so thanks :)" [software/homer] - 10https://gerrit.wikimedia.org/r/779849 (owner: 10Volans) [16:04:35] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of jawiki (for 5M pages each). [16:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:49] (03PS1) 10Zabe: Revert "Start writing to cuc_actor in guwwiki and shnwikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779906 (https://phabricator.wikimedia.org/T306045) [16:05:20] (03CR) 10Btullis: "Adding Arzhel for the traffic perspective." [puppet] - 10https://gerrit.wikimedia.org/r/779840 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [16:06:54] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of ruwiki (for 5M pages each). [16:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:56] (03CR) 10RhinosF1: [C: 03+1] Revert "Start writing to cuc_actor in guwwiki and shnwikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779906 (https://phabricator.wikimedia.org/T306045) (owner: 10Zabe) [16:07:05] (03CR) 10Reedy: [C: 03+2] Revert "Start writing to cuc_actor in guwwiki and shnwikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779906 (https://phabricator.wikimedia.org/T306045) (owner: 10Zabe) [16:07:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [software/homer] - 10https://gerrit.wikimedia.org/r/779850 (owner: 10Volans) [16:08:17] (03Merged) 10jenkins-bot: Revert "Start writing to cuc_actor in guwwiki and shnwikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779906 (https://phabricator.wikimedia.org/T306045) (owner: 10Zabe) [16:09:35] Thanks Reedy [16:09:37] (03CR) 10Cathal Mooney: [C: 03+2] Remove config/var for defining bespoke interfaces for IPv6 RAs [homer/public] - 10https://gerrit.wikimedia.org/r/779844 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:11:02] (03CR) 10Btullis: "Adding Arzhel for the traffic perspective." [dns] - 10https://gerrit.wikimedia.org/r/779839 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [16:11:11] (03Merged) 10jenkins-bot: Remove config/var for defining bespoke interfaces for IPv6 RAs [homer/public] - 10https://gerrit.wikimedia.org/r/779844 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:12:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:12:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [16:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [16:12:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:12:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24604 and previous config saved to /var/cache/conftool/dbconfig/20220413-161245-ladsgroup.json [16:12:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:18] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T306045 (duration: 00m 55s) [16:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:22] T306045: Column 'cuc_actor' cannot be null (localhost) when logging in with incorrect creds - https://phabricator.wikimedia.org/T306045 [16:13:36] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of cebwiki (for 5M pages each). [16:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:22] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of viwiki (for 5M pages each). [16:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:29] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/779846 (owner: 10Hnowlan) [16:20:48] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of metawiki (for 5M pages each). [16:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:36] huh, I wouldn’t have thought that metawiki even has 5M pages [16:22:23] (03CR) 10RLazarus: [C: 03+1] mediawiki: call siteinfo in HTTPS [software/spicerack] - 10https://gerrit.wikimedia.org/r/779841 (owner: 10Volans) [16:22:47] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10jmads) 05Resolved→03Open re-opening this ticket to restore access to analytics-privatedata-users ldap group. [16:24:17] Lucas_WMDE: It has over 8M user talk pages :O [16:26:35] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all of ruwikinews (for 5M pages each). [16:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:40] ah :D [16:39:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:48] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-launcher1002.eqiad.wmnet [16:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:50] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:40:20] !log reboot an-launcher1002 for security updates [16:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:23] (03CR) 10Volans: [C: 03+2] setup.py: add missing types for requests [software/homer] - 10https://gerrit.wikimedia.org/r/779849 (owner: 10Volans) [16:41:38] (03CR) 10Volans: [C: 03+2] capirca: catch also requests exceptions [software/homer] - 10https://gerrit.wikimedia.org/r/779850 (owner: 10Volans) [16:41:42] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:47] (03CR) 10Volans: [C: 03+2] mediawiki: call siteinfo in HTTPS [software/spicerack] - 10https://gerrit.wikimedia.org/r/779841 (owner: 10Volans) [16:41:53] any idea why I might be getting a base@gerrit.wikimedia.org: Permission denied (publickey). when attempting to clone or pull? I do have my ssh key added to gerrit. Might be something on my side, since I recently had a system upgrade, but gitlab.com clone works fine. [16:42:39] correct key loaded into the agent? [16:42:44] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:43:03] Reedy: well, I only have one [16:44:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:34] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10thcipriani) [16:45:41] (03Merged) 10jenkins-bot: setup.py: add missing types for requests [software/homer] - 10https://gerrit.wikimedia.org/r/779849 (owner: 10Volans) [16:45:43] (03Merged) 10jenkins-bot: capirca: catch also requests exceptions [software/homer] - 10https://gerrit.wikimedia.org/r/779850 (owner: 10Volans) [16:48:03] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-launcher1002.eqiad.wmnet [16:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:14] (03Merged) 10jenkins-bot: mediawiki: call siteinfo in HTTPS [software/spicerack] - 10https://gerrit.wikimedia.org/r/779841 (owner: 10Volans) [16:50:28] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all Wikidata clients of s2 (with --batch-size 250). [16:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:45] (03PS3) 10Hnowlan: changeprop: add sampling configuration, set num_workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/767080 (https://phabricator.wikimedia.org/T300914) [17:03:54] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all remaining Wikidata clients of s3 (with --batch-size 250). [17:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:59] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all remaining Wikidata clients of s5 (with --batch-size 250). [17:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:05] (03CR) 10Dzahn: "thanks, i'll do it soon" [puppet] - 10https://gerrit.wikimedia.org/r/779901 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:09:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24605 and previous config saved to /var/cache/conftool/dbconfig/20220413-170907-ladsgroup.json [17:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:09:44] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:07] (03PS1) 10Btullis: Use both dbproxy101[89] servers for both wikireplica services [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) [17:10:48] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for all remaining Wikidata clients of s7 (with --batch-size 250). [17:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:35] (03PS2) 10Btullis: Use both dbproxy101[89] servers for both wikireplica services [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) [17:12:41] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [17:12:48] (03CR) 10Volans: [C: 03+1] "The change looks sane, I don't have too much context to foresee all the possible implications, but I can't see anything wrong with it." [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [17:13:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:57] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10jmads) [17:16:41] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34820/console" [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [17:24:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24606 and previous config saved to /var/cache/conftool/dbconfig/20220413-172412-ladsgroup.json [17:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:02] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [17:28:17] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:54] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29|3(012)] - https://phabricator.wikimedia.org/T299459 (10Cmjohnson) [17:35:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24607 and previous config saved to /var/cache/conftool/dbconfig/20220413-173917-ladsgroup.json [17:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:39] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:46] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission kubernetes100[1-4] - https://phabricator.wikimedia.org/T303044 (10Cmjohnson) [17:42:05] (03PS2) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [17:43:34] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:44:51] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission kubernetes100[1-4] - https://phabricator.wikimedia.org/T303044 (10Cmjohnson) 05Open→03Resolved [17:44:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:42] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:46:45] (03CR) 10Raymond Ndibe: "the test is failing because the test tool doesn't recognize certain type hints. Wondering if we should remove those?" [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:48:43] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T304849 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson this is cloudstore1011, netbox is updated now [17:50:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) This is blocked until vlans for these switches are ready [17:51:04] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:51:44] 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10RobH) [17:51:58] 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10RobH) [17:52:34] 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10RobH) [17:54:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24608 and previous config saved to /var/cache/conftool/dbconfig/20220413-175422-ladsgroup.json [17:54:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [17:54:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [17:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24609 and previous config saved to /var/cache/conftool/dbconfig/20220413-175430-ladsgroup.json [17:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10RhinosF1) a:05fgiunchedi→03None [17:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10RhinosF1) Moving back to SRE queue [17:55:03] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10RhinosF1) >>! In T249873#7852201, @jmads wrote: > re-opening this ticket to restore access to analytics-privatedata-users ldap group. Is everything above still the same?... [17:57:11] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10RhinosF1) See also T249873 [17:57:37] (03PS1) 10Razzi: wikireplicas: depool clouddb1015-16 [puppet] - 10https://gerrit.wikimedia.org/r/779918 (https://phabricator.wikimedia.org/T299480) [18:00:04] dancy and jnuche: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220413T1800). [18:00:04] dancy and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220413T1800). [18:01:18] (03CR) 10Razzi: [C: 03+2] wikireplicas: depool clouddb1015-16 [puppet] - 10https://gerrit.wikimedia.org/r/779918 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [18:03:58] o/ [18:06:52] Train log triage will be tomorrow. [18:06:57] Rolling forward to group1. [18:07:20] (03PS1) 10Razzi: wikireplicas: fix depooling yaml [puppet] - 10https://gerrit.wikimedia.org/r/779919 (https://phabricator.wikimedia.org/T299480) [18:07:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10jmads) All info is still the same. Thanks! [18:09:54] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34822/console" [puppet] - 10https://gerrit.wikimedia.org/r/779919 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [18:10:45] (03CR) 10Razzi: [V: 03+1 C: 03+2] wikireplicas: fix depooling yaml [puppet] - 10https://gerrit.wikimedia.org/r/779919 (https://phabricator.wikimedia.org/T299480) (owner: 10Razzi) [18:15:36] (03PS1) 10Ahmon Dancy: group1 wikis to 1.39.0-wmf.7 refs T305213 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779921 [18:15:38] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.39.0-wmf.7 refs T305213 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779921 (owner: 10Ahmon Dancy) [18:16:36] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.7 refs T305213 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779921 (owner: 10Ahmon Dancy) [18:17:23] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.7 refs T305213 [18:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:27] T305213: 1.39.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T305213 [18:17:35] I'm going to re-run that. [18:19:09] (03CR) 10Jdlrobson: Enable Table of Contents AB test on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779551 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [18:19:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:19:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:47] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.7 refs T305213 [18:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:28] (03PS1) 10Zabe: Revert "Revert "Start writing to cuc_actor in guwwiki and shnwikivoyage"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779922 (https://phabricator.wikimedia.org/T233004) [18:21:44] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.7 refs T305213 (duration: 00m 56s) [18:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:49] (03CR) 10Zabe: [C: 04-1] "Needs https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/779912/ to safely be deployed on those wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779922 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [18:24:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:24:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:47] (03PS1) 10Nray: Correct AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779923 (https://phabricator.wikimedia.org/T302046) [18:25:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:00] (03CR) 10Nray: Enable Table of Contents AB test on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779551 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [18:26:02] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:27:43] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:27:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.325 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:28:08] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47966 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:29:08] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10nskaggs) @Papaul By default for HA purposes, we include language to spread servers out when needed. However, given these machines are in dev, and... [18:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:33:58] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrade to bullseye [18:34:01] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1015.eqiad.wmnet with reason: Upgrade to bullseye [18:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:40] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1015.eqiad.wmnet with OS bullseye [18:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:40] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [18:40:33] razzi: is that expected ^ [18:42:33] (03CR) 10Herron: [C: 03+1] mx: use $domain_data rather than $domain for aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779504 (https://phabricator.wikimedia.org/T305962) (owner: 10JHathaway) [18:44:38] 10SRE-Access-Requests, 10Data-Engineering, 10Generated Data Platform: Request to grant cparle and mfossati login to an-airflow1003.eqiad.wmne - https://phabricator.wikimedia.org/T306057 (10Ottomata) [18:46:07] 10SRE-Access-Requests, 10Data-Engineering, 10Generated Data Platform: Request to grant cparle and mfossati login to an-airflow1003.eqiad.wmne - https://phabricator.wikimedia.org/T306057 (10Ottomata) Tagging SRE-Access-Requests for help in figuring out how best to fulfill this request. Cormac and Marco are o... [18:47:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24610 and previous config saved to /var/cache/conftool/dbconfig/20220413-184721-ladsgroup.json [18:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:48:06] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1015.eqiad.wmnet with reason: host reimage [18:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:01] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1015.eqiad.wmnet with reason: host reimage [18:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:02] (03CR) 10JHathaway: [C: 03+2] mx: use $domain_data rather than $domain for aliases [puppet] - 10https://gerrit.wikimedia.org/r/779504 (https://phabricator.wikimedia.org/T305962) (owner: 10JHathaway) [18:53:20] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:55:08] (03PS1) 10Razzi: dbproxy: add clouddb sections to conftool [puppet] - 10https://gerrit.wikimedia.org/r/779926 (https://phabricator.wikimedia.org/T304478) [18:55:57] (03CR) 10Ebernhardson: [C: 03+2] team-search-platform: remove BlazegraphJvmQuakeWarnGC [alerts] - 10https://gerrit.wikimedia.org/r/779831 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [18:56:35] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34823/console" [puppet] - 10https://gerrit.wikimedia.org/r/779926 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [18:58:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10jhathaway) Our we ready to consider running black on our puppet repo? [19:00:12] (03Merged) 10jenkins-bot: team-search-platform: remove BlazegraphJvmQuakeWarnGC [alerts] - 10https://gerrit.wikimedia.org/r/779831 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [19:00:29] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: Exim emitting warnings about tainted filenames - https://phabricator.wikimedia.org/T305962 (10jhathaway) 05Open→03Resolved a:03jhathaway merged! [19:01:39] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24611 and previous config saved to /var/cache/conftool/dbconfig/20220413-190226-ladsgroup.json [19:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:15] (03CR) 10Majavah: [C: 04-1] dbproxy: add clouddb sections to conftool (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/779926 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [19:04:16] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [19:06:40] (03PS1) 10Andrew Bogott: OpenStack nova: change log level to 'debug' [puppet] - 10https://gerrit.wikimedia.org/r/779927 [19:08:04] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: change log level to 'debug' [puppet] - 10https://gerrit.wikimedia.org/r/779927 (owner: 10Andrew Bogott) [19:09:36] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:15:13] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1015.eqiad.wmnet with OS bullseye [19:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24612 and previous config saved to /var/cache/conftool/dbconfig/20220413-191731-ladsgroup.json [19:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:05] (03CR) 10Clare Ming: [C: 03+2] Correct AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779923 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [19:25:35] (03CR) 10Clare Ming: "whoops - sorry - got trigger happy before realizing it was config -- happy to deploy at next window which is in 30 mins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779923 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [19:27:57] (03CR) 10Nray: Correct AB test config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779923 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [19:32:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24613 and previous config saved to /var/cache/conftool/dbconfig/20220413-193236-ladsgroup.json [19:32:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [19:32:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [19:32:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:32:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24614 and previous config saved to /var/cache/conftool/dbconfig/20220413-193250-ladsgroup.json [19:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:02] (03CR) 10Jdlrobson: [C: 03+1] Correct AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779923 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [19:39:28] (03CR) 10Clare Ming: [C: 03+2] Correct AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779923 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [19:40:38] (03Merged) 10jenkins-bot: Correct AB test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779923 (https://phabricator.wikimedia.org/T302046) (owner: 10Nray) [19:42:50] (03PS7) 10Krinkle: List Kartographer static map exemptions and document+flip default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773883 (https://phabricator.wikimedia.org/T291736) [19:45:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:45:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:17] (03PS3) 10Phedenskog: grafana: provision JSON datasource [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) [19:49:50] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T306129 (10phaultfinder) [19:55:00] (JobUnavailable) firing: Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:58:28] (03PS1) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 [20:00:04] RoanKattouw, Urbanecm, and cjming: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220413T2000). Please do the needful. [20:00:04] JSherman, koi, and nn1l2: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] hi [20:00:13] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:00:17] o/ [20:00:42] hey [20:00:44] i can deploy today [20:00:59] ty! [20:01:16] Hello, I'm here! [20:01:51] hello JSherman and cjming [20:02:18] (03PS2) 10Urbanecm: Update enwiki surveys on beta for QA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779499 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [20:02:23] (03CR) 10Urbanecm: [C: 03+2] Update enwiki surveys on beta for QA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779499 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [20:03:03] JSherman: your patch should be auto-deployed within ~30 minutes [20:03:24] Thanks! I'll keep an eye, out urbanecm. [20:03:24] (03Merged) 10jenkins-bot: Update enwiki surveys on beta for QA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779499 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [20:03:31] (03PS2) 10Urbanecm: Optimize logo for Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779865 (https://phabricator.wikimedia.org/T306037) (owner: 10Stang) [20:03:46] (03CR) 10Urbanecm: [C: 03+2] Optimize logo for Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779865 (https://phabricator.wikimedia.org/T306037) (owner: 10Stang) [20:04:07] koi: your patch is up next :). will let you know once it can be tested. [20:04:22] got it, thanks [20:04:45] (03Merged) 10jenkins-bot: Optimize logo for Wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779865 (https://phabricator.wikimedia.org/T306037) (owner: 10Stang) [20:05:15] koi: your patch is at mwdebug1001 [20:05:18] can you have a look? [20:05:23] sure [20:05:26] (03PS2) 10Urbanecm: fawiki: Change logo for 900K milestone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779858 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [20:05:40] urbanecm: out of curiosity, is there more to do beyond steps in deployment commands https://deploy-commands.toolforge.org/bacc/779865 for files/images? i.e. purge caches for said files? [20:05:55] urbanecm, lgtm [20:06:06] cjming: yes. you need to run `purgeList.php` (accepts list of URIs at stdin) [20:06:22] note that the canonical domain for /static is en.wikipedia.org [20:06:47] cool - gtk [20:06:58] so you'd run sth like `echo 'https://en.wikipedia.org/static/images/project-logos/cswiki.png' | mwscript purgeList.php` for each static resource that was changed [20:07:07] koi: thanks, syncing [20:08:15] (03PS2) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 [20:08:41] (03CR) 10Urbanecm: [C: 03+2] fawiki: Change logo for 900K milestone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779858 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [20:09:13] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 076e6ef: Optimize logo for Wikispecies (T306037; 1/2) (duration: 00m 55s) [20:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:18] T306037: Optimize logo for Wikispecies - https://phabricator.wikimedia.org/T306037 [20:09:54] (03Merged) 10jenkins-bot: fawiki: Change logo for 900K milestone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779858 (https://phabricator.wikimedia.org/T306030) (owner: 104nn1l2) [20:10:07] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 076e6ef: Optimize logo for Wikispecies (T306037; 2/2) (duration: 00m 53s) [20:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:40] nn1l2: your patch is at mwdebug1001 [20:10:42] can you have a look? [20:10:45] ok [20:10:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:10:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:05] (03PS3) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 [20:11:11] LGTM [20:11:18] syncing [20:12:57] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-fa-900K.svg: dfe0b9c: fawiki: Change logo for 900K milestone (T306030; 1/2) (duration: 00m 56s) [20:13:00] (03PS2) 10Razzi: dbproxy: add clouddb sections to conftool [puppet] - 10https://gerrit.wikimedia.org/r/779926 (https://phabricator.wikimedia.org/T304478) [20:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:02] T306030: Change the logo of Farsi Wikipedia for 900K milestone - https://phabricator.wikimedia.org/T306030 [20:13:15] 10SRE, 10MediaWiki-REST-API, 10Traffic-Icebox: Route requests to the REST MediaWiki API to the api cluster - https://phabricator.wikimedia.org/T263729 (10BBlack) [20:13:52] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: dfe0b9c: fawiki: Change logo for 900K milestone (T306030; 2/2) (duration: 00m 54s) [20:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:56] 10SRE, 10Traffic, 10serviceops, 10Platform Team Workboards (Green): MW REST API should be routed to api_appserver MW cluster - https://phabricator.wikimedia.org/T268043 (10BBlack) [20:14:20] nn1l2: should be all done [20:14:22] anything else, anyone? [20:14:42] thanks [20:15:08] np [20:15:10] urbanecm, the logo is still the previous version https://species.wikimedia.org/static/images/project-logos/specieswiki-2x.png [20:15:22] is the syncing completed? [20:15:27] it should be [20:15:29] but let me double check [20:15:57] koi: i purged it again, and now it seems to work [20:15:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:16:01] perhaps i purged a bit early [20:16:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:36] hmm, still not working in my place 0 0 [20:17:38] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrade to bullseye [20:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:40] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Upgrade to bullseye [20:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:28] koi: did you try to purge your client side cache? [20:18:34] (ctrl+shift+r should do the trick) [20:18:52] yeah, I even tried another browser [20:19:09] koi: do you try accessing https://species.wikimedia.org/static/images/project-logos/specieswiki-2x.png directly? [20:19:35] yes, it is still the old version [20:20:08] interesting... [20:20:14] koi: i suggest to wait ~48 hours [20:20:29] if it's still broken then, please let me know and we can investigate further [20:21:13] urbanecm, got it, hope the logo will get changed soon [20:21:18] let's see :) [20:21:30] it does work on my end, so that indicates it's not a server-side problem [20:21:36] (03CR) 10Razzi: [V: 03+1] dbproxy: add clouddb sections to conftool (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/779926 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [20:23:27] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1016.eqiad.wmnet with OS bullseye [20:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:47] (03CR) 10Ssingh: "PCC error on dns1001 results from a parameter mismatch:" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (owner: 10Ssingh) [20:26:53] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [20:30:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24615 and previous config saved to /var/cache/conftool/dbconfig/20220413-203030-ladsgroup.json [20:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:31:23] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Ladsgroup) Do we really need this now that everything is on flink and fancy? [20:32:46] (03PS4) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 [20:34:36] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10dancy) [20:34:47] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1016.eqiad.wmnet with reason: host reimage [20:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:32] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10dancy) [20:36:17] !log razzi@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on clouddb1016.eqiad.wmnet with reason: host reimage [20:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:02] urbanecm: It looks like the QuickSurveys extension is not enabled on enwiki beta. I couldn't get any of the configured surveys to load (even those that were there already), so I checked Special:version and it's not there. I found where it's enabled on some wikis in InitializeSettings.php with wmgUseQuickSurveys, but I couldn't find that set in [20:41:03] InitializeSettings-labs.php. I verified that is working on eswiki, which has $wmgUseQuickSurveys set to true in InitializeSettings.php. To enable this in enwiki beta (but not prod), would I add wmgUseQuickSurveys to InitializeSettings-labs.php and just set it true for enwiki? [20:41:37] verified it was working on *eswiki beta* [20:44:40] JSherman: yes. Just adding it to is-labs should do the trick. [20:44:53] (03PS1) 10Andrew Bogott: Revert "OpenStack nova: change log level to 'debug'" [puppet] - 10https://gerrit.wikimedia.org/r/779939 [20:45:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24616 and previous config saved to /var/cache/conftool/dbconfig/20220413-204535-ladsgroup.json [20:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:10] (03CR) 10Andrew Bogott: [C: 03+2] Revert "OpenStack nova: change log level to 'debug'" [puppet] - 10https://gerrit.wikimedia.org/r/779939 (owner: 10Andrew Bogott) [20:46:43] (03PS5) 10Ssingh: dnsrecursor: refactor module (see detailed commit message) [puppet] - 10https://gerrit.wikimedia.org/r/779936 [20:49:10] (03PS1) 10Jsn.sherman: Enable QuickSurveys on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779940 (https://phabricator.wikimedia.org/T294363) [20:51:24] urbanecm: mmk, I worked up a change for that; I justadded wmgUseQuickSurveys to IS-labs with enwiki => as the only setting inside [20:51:25] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/779940 [20:51:49] *enwiki => true* [20:52:18] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1016.eqiad.wmnet with OS bullseye [20:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:21] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Potential navtiming_responseStart regression as of 13 Mar 2022 - https://phabricator.wikimedia.org/T303782 (10Krinkle) 05Open→03Resolved There seems to be an upward trend that is continying having possibly added around ~25ms (5% of 500ms) on both the... [20:53:03] Great :) [20:53:51] (03PS1) 10Razzi: wikireplicas: depool clouddb1017-1020 and repool 15 and 16 [puppet] - 10https://gerrit.wikimedia.org/r/779941 (https://phabricator.wikimedia.org/T304478) [20:55:12] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34831/console" [puppet] - 10https://gerrit.wikimedia.org/r/779941 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [20:56:17] urbanecm: is it possible to also deploy 779940 as well, or do I need to schedule for another day? [20:56:27] let's do it [20:56:32] (today) [20:56:37] Ok! [20:56:45] (03CR) 10Urbanecm: [C: 03+2] Enable QuickSurveys on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779940 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [20:56:47] let's see :) [20:57:25] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [20:57:36] (03Merged) 10jenkins-bot: Enable QuickSurveys on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779940 (https://phabricator.wikimedia.org/T294363) (owner: 10Jsn.sherman) [21:00:13] (03CR) 10Razzi: [V: 03+1 C: 03+2] wikireplicas: depool clouddb1017-1020 and repool 15 and 16 [puppet] - 10https://gerrit.wikimedia.org/r/779941 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [21:00:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24617 and previous config saved to /var/cache/conftool/dbconfig/20220413-210041-ladsgroup.json [21:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:01:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:01:38] (03PS2) 10Razzi: wikireplicas: depool clouddb1017-1020 and repool 15 and 16 [puppet] - 10https://gerrit.wikimedia.org/r/779941 (https://phabricator.wikimedia.org/T304478) [21:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:14] (03PS1) 10Krinkle: static: Remove `/static/current` symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779944 (https://phabricator.wikimedia.org/T302465) [21:02:44] (03Abandoned) 10Razzi: wikireplicas: depool clouddb1017-1020 and repool 15 and 16 [puppet] - 10https://gerrit.wikimedia.org/r/779941 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [21:02:56] JSherman: sorry, got distracted. should be auto-deployed soon(ish) to beta [21:02:57] (as bfore :)) [21:03:15] (03CR) 10Krinkle: "Health checks were the last remaining reference, which has been removed/updated in Puppet with I3cd083bcadfa75da40." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779944 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [21:03:16] *before [21:03:25] (03PS1) 10Razzi: wikireplicas: depool clouddb1017-1020 and repool 15 and 16 [puppet] - 10https://gerrit.wikimedia.org/r/779945 (https://phabricator.wikimedia.org/T304478) [21:03:48] urbanecm: No worries, thanks for the bonus deploy! [21:03:53] happy to help! [21:04:59] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34833/console" [puppet] - 10https://gerrit.wikimedia.org/r/779945 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [21:06:27] urbanecm: I verified all 8 surveys are now up and running on enwiki beta; thank you 1 000 000! [21:06:34] great! [21:08:29] (03CR) 10Razzi: [V: 03+1 C: 03+2] wikireplicas: depool clouddb1017-1020 and repool 15 and 16 [puppet] - 10https://gerrit.wikimedia.org/r/779945 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [21:10:00] (03PS1) 10Ladsgroup: MigrateLinksTable: Avoid dynamic loading of list columns to select [core] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/779877 (https://phabricator.wikimedia.org/T299424) [21:10:07] jouncebot: nowandnext [21:10:07] No deployments scheduled for the next 8 hour(s) and 49 minute(s) [21:10:07] In 8 hour(s) and 49 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220414T0600) [21:10:13] nice [21:10:25] (03CR) 10Ladsgroup: [C: 03+2] MigrateLinksTable: Avoid dynamic loading of list columns to select [core] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/779877 (https://phabricator.wikimedia.org/T299424) (owner: 10Ladsgroup) [21:15:04] (03PS1) 10Ladsgroup: admin: Fix Tran's real name [puppet] - 10https://gerrit.wikimedia.org/r/779947 [21:15:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24618 and previous config saved to /var/cache/conftool/dbconfig/20220413-211546-ladsgroup.json [21:15:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [21:15:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [21:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:17] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Fix Tran's real name [puppet] - 10https://gerrit.wikimedia.org/r/779947 (owner: 10Ladsgroup) [21:16:45] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrade to bullseye [21:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:47] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1017.eqiad.wmnet with reason: Upgrade to bullseye [21:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:22] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10RobH) [21:18:19] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1017.eqiad.wmnet with OS bullseye [21:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:29] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10RobH) a:05RobH→03ayounsi Arzhel, When we set this up, I recall you saying you didn't want to move the connections in netbox, and wanted... [21:22:09] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [21:29:07] (03Merged) 10jenkins-bot: MigrateLinksTable: Avoid dynamic loading of list columns to select [core] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/779877 (https://phabricator.wikimedia.org/T299424) (owner: 10Ladsgroup) [21:29:48] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1017.eqiad.wmnet with reason: host reimage [21:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:39] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.7/maintenance/migrateLinksTable.php: Backport: [[gerrit:779877|MigrateLinksTable: Avoid dynamic loading of list columns to select (T299424)]] (duration: 00m 55s) [21:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:43] T299424: Run maintenance script backfilling tl_title_id - https://phabricator.wikimedia.org/T299424 [21:32:49] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1017.eqiad.wmnet with reason: host reimage [21:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:37:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:01] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrade to bullseye [21:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:03] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1018.eqiad.wmnet with reason: Upgrade to bullseye [21:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:39] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1018.eqiad.wmnet with OS bullseye [21:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:21] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrade to bullseye [21:47:23] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1019.eqiad.wmnet with reason: Upgrade to bullseye [21:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10tox-wikimedia, and 2 others: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Volans) >>! In T211750#7853334, @jhathaway wrote: > Our we ready to consider running black on our puppet repo? I'm not sure, personally I t... [21:47:59] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1017.eqiad.wmnet with OS bullseye [21:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:19] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [21:48:54] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS bullseye [21:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:39] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779944 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [21:51:11] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrade to bullseye [21:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:14] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1020.eqiad.wmnet with reason: Upgrade to bullseye [21:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:17] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) @akosiaris thanks will move them tomorrow. [21:53:55] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1018.eqiad.wmnet with reason: host reimage [21:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:12] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1020.eqiad.wmnet with OS bullseye [21:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:22] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1018.eqiad.wmnet with reason: host reimage [21:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:54] 10SRE-swift-storage, 10UploadWizard, 10Unstewarded-production-error, 10Wikimedia-production-error: "Could not store upload in the stash (UploadStashFileException)" for 2.4 GiB TIF file - https://phabricator.wikimedia.org/T285341 (10Krinkle) 05Open→03Resolved a:03Krinkle Likedly caused by 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [21:59:30] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10Ladsgroup) [21:59:48] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1019.eqiad.wmnet with reason: host reimage [21:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [22:01:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [22:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:13] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1019.eqiad.wmnet with reason: host reimage [22:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:20] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1021.eqiad.wmnet with reason: Upgrade to bullseye [22:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:22] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1021.eqiad.wmnet with reason: Upgrade to bullseye [22:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:18] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1020.eqiad.wmnet with reason: host reimage [22:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:31] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1021.eqiad.wmnet with reason: Upgrade to bullseye [22:06:33] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1021.eqiad.wmnet with reason: Upgrade to bullseye [22:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:42] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:07:00] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1021.eqiad.wmnet with OS bullseye [22:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:44] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1020.eqiad.wmnet with reason: host reimage [22:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:19] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [22:11:57] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:12:45] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1018.eqiad.wmnet with OS bullseye [22:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:15:38] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1019.eqiad.wmnet with OS bullseye [22:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:23:53] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [22:23:59] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) @Andrew @aborrero I have listed 14 servers that we will have to move into rack b1 4 of those are not in row B and using Public IP. I think will be bette... [22:24:10] 10SRE, 10ops-codfw, 10Data-Persistence (Consultation): codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [22:25:33] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [22:30:32] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1020.eqiad.wmnet with OS bullseye [22:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:10] !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddb1021.eqiad.wmnet with OS bullseye [22:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:54] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:47:42] (03CR) 10Krinkle: Add "db-mainstash" entry to $wgObjectCaches (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [22:56:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [22:56:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [22:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24620 and previous config saved to /var/cache/conftool/dbconfig/20220413-225612-ladsgroup.json [22:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:15] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:56:58] 10SRE, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Krinkle) [23:01:54] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:31:41] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:31:55] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:32:07] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:36:09] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:36:21] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:36:33] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:52:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24622 and previous config saved to /var/cache/conftool/dbconfig/20220413-235235-ladsgroup.json [23:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:39] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565