[00:00:09] RECOVERY - Check systemd state on ml-staging-ctrl2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P26649 and previous config saved to /var/cache/conftool/dbconfig/20220427-000927-ladsgroup.json [00:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298554)', diff saved to https://phabricator.wikimedia.org/P26650 and previous config saved to /var/cache/conftool/dbconfig/20220427-001438-ladsgroup.json [00:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:44] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [00:24:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P26651 and previous config saved to /var/cache/conftool/dbconfig/20220427-002432-ladsgroup.json [00:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:39] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [00:29:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26652 and previous config saved to /var/cache/conftool/dbconfig/20220427-002943-ladsgroup.json [00:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298556)', diff saved to https://phabricator.wikimedia.org/P26653 and previous config saved to /var/cache/conftool/dbconfig/20220427-004015-ladsgroup.json [00:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:22] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [00:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26654 and previous config saved to /var/cache/conftool/dbconfig/20220427-004448-ladsgroup.json [00:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26655 and previous config saved to /var/cache/conftool/dbconfig/20220427-005520-ladsgroup.json [00:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298554)', diff saved to https://phabricator.wikimedia.org/P26656 and previous config saved to /var/cache/conftool/dbconfig/20220427-005953-ladsgroup.json [00:59:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [00:59:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [00:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:00] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [01:00:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298554)', diff saved to https://phabricator.wikimedia.org/P26657 and previous config saved to /var/cache/conftool/dbconfig/20220427-010001-ladsgroup.json [01:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26658 and previous config saved to /var/cache/conftool/dbconfig/20220427-011025-ladsgroup.json [01:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298556)', diff saved to https://phabricator.wikimedia.org/P26659 and previous config saved to /var/cache/conftool/dbconfig/20220427-012530-ladsgroup.json [01:25:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [01:25:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1098.eqiad.wmnet with reason: Maintenance [01:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:38] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [01:25:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298556)', diff saved to https://phabricator.wikimedia.org/P26660 and previous config saved to /var/cache/conftool/dbconfig/20220427-012538-ladsgroup.json [01:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298556)', diff saved to https://phabricator.wikimedia.org/P26661 and previous config saved to /var/cache/conftool/dbconfig/20220427-012850-ladsgroup.json [01:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298554)', diff saved to https://phabricator.wikimedia.org/P26662 and previous config saved to /var/cache/conftool/dbconfig/20220427-013530-ladsgroup.json [01:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:37] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [01:38:45] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 48.37 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:39:15] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 57.06 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:39:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:05] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:41:33] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 90.92 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:43:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26663 and previous config saved to /var/cache/conftool/dbconfig/20220427-014355-ladsgroup.json [01:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:25] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:44:25] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:45:05] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:49:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26664 and previous config saved to /var/cache/conftool/dbconfig/20220427-015035-ladsgroup.json [01:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26665 and previous config saved to /var/cache/conftool/dbconfig/20220427-015900-ladsgroup.json [01:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26666 and previous config saved to /var/cache/conftool/dbconfig/20220427-020540-ladsgroup.json [02:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298556)', diff saved to https://phabricator.wikimedia.org/P26667 and previous config saved to /var/cache/conftool/dbconfig/20220427-021405-ladsgroup.json [02:14:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [02:14:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1140.eqiad.wmnet with reason: Maintenance [02:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:12] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [02:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2129.codfw.wmnet with reason: Maintenance [02:14:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2129.codfw.wmnet with reason: Maintenance [02:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance [02:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance [02:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [02:14:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [02:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [02:14:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [02:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T298556)', diff saved to https://phabricator.wikimedia.org/P26668 and previous config saved to /var/cache/conftool/dbconfig/20220427-021450-ladsgroup.json [02:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298556)', diff saved to https://phabricator.wikimedia.org/P26669 and previous config saved to /var/cache/conftool/dbconfig/20220427-021702-ladsgroup.json [02:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298554)', diff saved to https://phabricator.wikimedia.org/P26670 and previous config saved to /var/cache/conftool/dbconfig/20220427-022045-ladsgroup.json [02:20:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [02:20:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [02:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:52] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [02:20:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298554)', diff saved to https://phabricator.wikimedia.org/P26671 and previous config saved to /var/cache/conftool/dbconfig/20220427-022053-ladsgroup.json [02:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P26672 and previous config saved to /var/cache/conftool/dbconfig/20220427-023207-ladsgroup.json [02:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:44:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298554)', diff saved to https://phabricator.wikimedia.org/P26673 and previous config saved to /var/cache/conftool/dbconfig/20220427-024409-ladsgroup.json [02:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:16] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [02:47:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P26674 and previous config saved to /var/cache/conftool/dbconfig/20220427-024712-ladsgroup.json [02:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:02:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298556)', diff saved to https://phabricator.wikimedia.org/P26675 and previous config saved to /var/cache/conftool/dbconfig/20220427-030217-ladsgroup.json [03:02:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [03:02:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1131.eqiad.wmnet with reason: Maintenance [03:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:24] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [03:02:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298556)', diff saved to https://phabricator.wikimedia.org/P26676 and previous config saved to /var/cache/conftool/dbconfig/20220427-030225-ladsgroup.json [03:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298556)', diff saved to https://phabricator.wikimedia.org/P26677 and previous config saved to /var/cache/conftool/dbconfig/20220427-030433-ladsgroup.json [03:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:29] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:19:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P26678 and previous config saved to /var/cache/conftool/dbconfig/20220427-031938-ladsgroup.json [03:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:25:42] (03PS1) 10KartikMistry: Enable Section Translation for Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786446 (https://phabricator.wikimedia.org/T304862) [03:29:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [03:29:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [03:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P26679 and previous config saved to /var/cache/conftool/dbconfig/20220427-033443-ladsgroup.json [03:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:37] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 35.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:48:39] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 30.86 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:49:10] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 40.31 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:49:15] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 29.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:49:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298556)', diff saved to https://phabricator.wikimedia.org/P26680 and previous config saved to /var/cache/conftool/dbconfig/20220427-034948-ladsgroup.json [03:49:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [03:49:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1165.eqiad.wmnet with reason: Maintenance [03:49:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:55] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [03:49:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298556)', diff saved to https://phabricator.wikimedia.org/P26681 and previous config saved to /var/cache/conftool/dbconfig/20220427-035001-ladsgroup.json [03:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:51:25] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 72.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:51:31] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:52:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298556)', diff saved to https://phabricator.wikimedia.org/P26682 and previous config saved to /var/cache/conftool/dbconfig/20220427-035208-ladsgroup.json [03:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:11] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:54:25] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 91.88 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [03:59:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [03:59:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [03:59:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [03:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [04:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:07:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P26683 and previous config saved to /var/cache/conftool/dbconfig/20220427-040714-ladsgroup.json [04:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:22:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P26684 and previous config saved to /var/cache/conftool/dbconfig/20220427-042219-ladsgroup.json [04:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298556)', diff saved to https://phabricator.wikimedia.org/P26685 and previous config saved to /var/cache/conftool/dbconfig/20220427-043724-ladsgroup.json [04:37:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance [04:37:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1168.eqiad.wmnet with reason: Maintenance [04:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:32] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [04:37:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298556)', diff saved to https://phabricator.wikimedia.org/P26686 and previous config saved to /var/cache/conftool/dbconfig/20220427-043732-ladsgroup.json [04:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298556)', diff saved to https://phabricator.wikimedia.org/P26687 and previous config saved to /var/cache/conftool/dbconfig/20220427-044040-ladsgroup.json [04:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:55] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:51:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [04:51:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [04:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298554)', diff saved to https://phabricator.wikimedia.org/P26688 and previous config saved to /var/cache/conftool/dbconfig/20220427-045153-ladsgroup.json [04:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:01] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [04:55:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26689 and previous config saved to /var/cache/conftool/dbconfig/20220427-045545-ladsgroup.json [04:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26690 and previous config saved to /var/cache/conftool/dbconfig/20220427-051050-ladsgroup.json [05:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:57] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:25:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298554)', diff saved to https://phabricator.wikimedia.org/P26691 and previous config saved to /var/cache/conftool/dbconfig/20220427-052549-ladsgroup.json [05:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:55] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [05:25:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298556)', diff saved to https://phabricator.wikimedia.org/P26692 and previous config saved to /var/cache/conftool/dbconfig/20220427-052555-ladsgroup.json [05:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:01] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [05:37:56] (03PS1) 10Marostegui: production.my.cnf.erb: Leave rowid disabled [puppet] - 10https://gerrit.wikimedia.org/r/786670 (https://phabricator.wikimedia.org/T301879) [05:39:28] (03CR) 10Marostegui: [C: 03+2] production.my.cnf.erb: Leave rowid disabled [puppet] - 10https://gerrit.wikimedia.org/r/786670 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [05:40:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26693 and previous config saved to /var/cache/conftool/dbconfig/20220427-054054-ladsgroup.json [05:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26694 and previous config saved to /var/cache/conftool/dbconfig/20220427-055559-ladsgroup.json [05:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: preserve rules ordering in `requestctl commit` [software/conftool] - 10https://gerrit.wikimedia.org/r/786312 (owner: 10Giuseppe Lavagetto) [06:07:25] (03Merged) 10jenkins-bot: requestctl: preserve rules ordering in `requestctl commit` [software/conftool] - 10https://gerrit.wikimedia.org/r/786312 (owner: 10Giuseppe Lavagetto) [06:11:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298554)', diff saved to https://phabricator.wikimedia.org/P26695 and previous config saved to /var/cache/conftool/dbconfig/20220427-061104-ladsgroup.json [06:11:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [06:11:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [06:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:11] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [06:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298554)', diff saved to https://phabricator.wikimedia.org/P26696 and previous config saved to /var/cache/conftool/dbconfig/20220427-061112-ladsgroup.json [06:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:47] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:32:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New version 2.1.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/786313 (owner: 10Giuseppe Lavagetto) [06:35:35] (03Merged) 10jenkins-bot: New version 2.1.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/786313 (owner: 10Giuseppe Lavagetto) [06:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:45:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298554)', diff saved to https://phabricator.wikimedia.org/P26697 and previous config saved to /var/cache/conftool/dbconfig/20220427-064551-ladsgroup.json [06:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:58] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [06:50:23] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:58:39] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Amir1, awight, Urbanecm, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220427T0700). [07:00:05] koi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26698 and previous config saved to /var/cache/conftool/dbconfig/20220427-070056-ladsgroup.json [07:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:03:01] hmm, is there anyone [07:11:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [07:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [07:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:41] 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Patch-For-Review: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Shizhao) 05Open→03Stalled "Merge Conflict". This task looks like it needs to wait for T44473? >>! 在T142991#7872921中,@gerritbot写道: > https://ger... [07:14:37] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) We have discussed this issue in the #serviceops channel yesterday, and the idea is to indeed use labels. The ML clusters... [07:16:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26699 and previous config saved to /var/cache/conftool/dbconfig/20220427-071601-ladsgroup.json [07:16:05] 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Patch-For-Review: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang) No follow up in T44473 for a pretty long period so... Merge conflict is not a problem as rebase will be performed before merge. [07:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:47] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:24:18] !log installing libxml2 security updates [07:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298554)', diff saved to https://phabricator.wikimedia.org/P26700 and previous config saved to /var/cache/conftool/dbconfig/20220427-073106-ladsgroup.json [07:31:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:31:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:14] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [07:31:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298554)', diff saved to https://phabricator.wikimedia.org/P26701 and previous config saved to /var/cache/conftool/dbconfig/20220427-073114-ladsgroup.json [07:31:15] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10Joe) >>! In T306649#7883435, @elukey wrote: > This will change the topology of the BGP mesh though: some nodes we'll have multi... [07:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:52] (03PS1) 10Majavah: P:openstack::haproxy: enable built-in prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/786783 [07:33:26] Hi taavi are you around? [07:35:56] (03CR) 10Filippo Giunchedi: C:monitoring: Add define for creating http checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786365 (owner: 10Jbond) [07:39:16] koi: not enough to deploy config changes :( [07:39:50] um, nobody here so.. [07:39:59] RECOVERY - Host db1139.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [07:42:54] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10fgiunchedi) >>! In T306424#7868405, @fgiunchedi wrote: >>>! In T306424#7867892, @Jgian... [07:44:19] RECOVERY - Host db1140.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [07:48:13] (03CR) 10David Caro: [C: 03+1] "Neat, for my future reference:" [puppet] - 10https://gerrit.wikimedia.org/r/786783 (owner: 10Majavah) [07:49:53] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti-test2001.codfw.wmnet with reason: bullseye update [07:49:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti-test2001.codfw.wmnet with reason: bullseye update [07:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:54:30] (03PS3) 10Giuseppe Lavagetto: varnish: switch to using new-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778544 (https://phabricator.wikimedia.org/T305606) [07:59:29] (03PS1) 10Klausman: admin_ng: Add config for ML staging k8s in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) [07:59:49] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298554)', diff saved to https://phabricator.wikimedia.org/P26702 and previous config saved to /var/cache/conftool/dbconfig/20220427-080425-ladsgroup.json [08:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:32] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [08:13:55] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/786400 (owner: 10Volans) [08:17:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase db1132 weight T301879', diff saved to https://phabricator.wikimedia.org/P26703 and previous config saved to /var/cache/conftool/dbconfig/20220427-081727-marostegui.json [08:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:33] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [08:19:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26704 and previous config saved to /var/cache/conftool/dbconfig/20220427-081931-ladsgroup.json [08:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:28] (03PS1) 10Muehlenhoff: ganeti.addnode: Fix up bridge detection for Bullseye changes [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 [08:23:57] (03PS12) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [08:25:36] (03CR) 10jerkins-bot: [V: 04-1] ganeti.addnode: Fix up bridge detection for Bullseye changes [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 (owner: 10Muehlenhoff) [08:29:58] (03PS2) 10Muehlenhoff: ganeti.addnode: Fix up bridge detection for Bullseye changes [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 [08:32:52] (03CR) 10jerkins-bot: [V: 04-1] ganeti.addnode: Fix up bridge detection for Bullseye changes [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 (owner: 10Muehlenhoff) [08:33:20] (03PS13) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [08:33:32] (03CR) 10Elukey: "This alone doesn't work, we need to update the helmfile.yaml config as well (plus other bits, going to check if anything is missing in the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [08:34:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26705 and previous config saved to /var/cache/conftool/dbconfig/20220427-083436-ladsgroup.json [08:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:10] (03CR) 10Volans: [C: 03+2] homer: suppress cryptography deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/786400 (owner: 10Volans) [08:36:23] (03PS3) 10Muehlenhoff: ganeti.addnode: Fix up bridge detection for Bullseye changes [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 [08:36:50] (03CR) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [08:37:29] (03CR) 10Elukey: "There are also the bits related to ml-serve.yaml, some of them should be shared in theory. Not sure how the serviceops staging cluster doe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [08:40:19] (03CR) 10jerkins-bot: [V: 04-1] ganeti.addnode: Fix up bridge detection for Bullseye changes [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 (owner: 10Muehlenhoff) [08:45:09] (03CR) 10Klausman: admin_ng: Add config for ML staging k8s in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [08:45:15] (03CR) 10Volans: [C: 04-1] "How often do you envision those aliases to be used? I'm not sure if they are worth given the implementation difficulties (see inline for t" [puppet] - 10https://gerrit.wikimedia.org/r/786430 (https://phabricator.wikimedia.org/T306830) (owner: 10Dzahn) [08:47:59] (03CR) 10Elukey: admin_ng: Add config for ML staging k8s in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [08:49:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298554)', diff saved to https://phabricator.wikimedia.org/P26706 and previous config saved to /var/cache/conftool/dbconfig/20220427-084941-ladsgroup.json [08:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:48] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [08:50:20] (03CR) 10Volans: ganeti.addnode: Fix up bridge detection for Bullseye changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 (owner: 10Muehlenhoff) [08:52:56] (03PS2) 10Klausman: admin_ng: Add config for ML staging k8s in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) [08:53:48] (03CR) 10Muehlenhoff: ganeti.addnode: Fix up bridge detection for Bullseye changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 (owner: 10Muehlenhoff) [08:53:58] (03PS4) 10Muehlenhoff: ganeti.addnode: Fix up bridge detection for Bullseye changes [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 [08:54:33] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:54:36] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1011.eqiad.wmnet with OS bullseye [08:54:40] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1011.eqiad.wmnet with OS bullseye [08:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:47] (03CR) 10RhinosF1: "note: analytics & data engineering are the same team. I spoke to BTullis and they said they'd upload a patch to move analytics to DE." [puppet] - 10https://gerrit.wikimedia.org/r/786430 (https://phabricator.wikimedia.org/T306830) (owner: 10Dzahn) [08:56:34] (03PS1) 10Btullis: Update role contacts to refelect team name change [puppet] - 10https://gerrit.wikimedia.org/r/786848 (https://phabricator.wikimedia.org/T306830) [08:58:38] the refreshLinks job backlog time has been steadily climbing since Sunday (reaching 1.5 days now), does anyone know if it’s already being looked into? https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job?var-job=refreshLinks [08:58:40] (03CR) 10jerkins-bot: [V: 04-1] admin_ng: Add config for ML staging k8s in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [08:59:26] I’m not sure if it’s specific to this job or maybe a general job queue problem – we’re also seeing some strange behavior in e.g. InjectRCRecords, but not the same behavior [08:59:49] (03CR) 10Elukey: admin_ng: Add config for ML staging k8s in codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:01:07] (03PS3) 10Klausman: admin_ng: Add config for ML staging k8s in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) [09:01:21] (03CR) 10Klausman: admin_ng: Add config for ML staging k8s in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [09:01:44] (03CR) 10Hashar: [C: 03+1] docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [09:04:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/786848 (https://phabricator.wikimedia.org/T306830) (owner: 10Btullis) [09:04:30] (03CR) 10Btullis: cumin: add "owner" aliases to get lists of host per SRE subteam (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786430 (https://phabricator.wikimedia.org/T306830) (owner: 10Dzahn) [09:04:41] (03CR) 10Btullis: [C: 03+2] Update role contacts to refelect team name change [puppet] - 10https://gerrit.wikimedia.org/r/786848 (https://phabricator.wikimedia.org/T306830) (owner: 10Btullis) [09:05:02] (03CR) 10Hashar: [C: 03+1] "Puppet compiler result https://puppet-compiler.wmflabs.org/pcc-worker1002/34966/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/773784 (owner: 10Hashar) [09:06:01] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:55] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage [09:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:46] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1011.eqiad.wmnet with reason: host reimage [09:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:45] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1011.eqiad.wmnet with OS bullseye [09:22:49] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1011.eqiad.wmnet with OS bullseye completed: - ms-fe1011 (**WARN**) - Downtim... [09:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:06] (03PS4) 10Jbond: P:gerrit: Add logoutd script for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/705426 (https://phabricator.wikimedia.org/T286905) [09:29:16] !log repool the hosts we split off on 2022-04-23 as dedicated videoscalers in the jobrunner cluster. The videoscaling load seems to be normal again. [09:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:21] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1338.eqiad.wmnet [09:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:31] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1437.eqiad.wmnet [09:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:38] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1438.eqiad.wmnet [09:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:44] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1439.eqiad.wmnet [09:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:54] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1440.eqiad.wmnet [09:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:01] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1445.eqiad.wmnet [09:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:11] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1446.eqiad.wmnet [09:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:03] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1010.eqiad.wmnet with OS bullseye [09:33:08] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1010.eqiad.wmnet with OS bullseye [09:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:10] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:09] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update to 6.4.6.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/776010 (owner: 10Muehlenhoff) [09:42:45] (03PS1) 10Filippo Giunchedi: clinic-duty: add Arelion to Telia detection [software] - 10https://gerrit.wikimedia.org/r/786893 [09:43:21] (03CR) 10Filippo Giunchedi: "I don't know the Telia/Arelion story, adding folks who might 😊" [software] - 10https://gerrit.wikimedia.org/r/786893 (owner: 10Filippo Giunchedi) [09:44:15] !log rolling upgrade trafficserver to 8.0.8-1wm6 on ulsfo [09:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:23] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1010.eqiad.wmnet with reason: host reimage [09:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:27] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1010.eqiad.wmnet with reason: host reimage [09:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:14] (03PS1) 10Volans: sre.hosts.reimage: run puppet on configmaster.w.o [cookbooks] - 10https://gerrit.wikimedia.org/r/786894 [09:59:20] (03PS1) 10Muehlenhoff: ganeti.addnode: Switch bridge detection to a check based on "ip" [cookbooks] - 10https://gerrit.wikimedia.org/r/786895 [10:03:08] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1010.eqiad.wmnet with OS bullseye [10:03:11] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1010.eqiad.wmnet with OS bullseye completed: - ms-fe1010 (**WARN**) - Downtim... [10:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:59] (03CR) 10MVernon: [C: 03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/786894 (owner: 10Volans) [10:06:24] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: run puppet on configmaster.w.o [cookbooks] - 10https://gerrit.wikimedia.org/r/786894 (owner: 10Volans) [10:10:43] (03Merged) 10jenkins-bot: sre.hosts.reimage: run puppet on configmaster.w.o [cookbooks] - 10https://gerrit.wikimedia.org/r/786894 (owner: 10Volans) [10:16:56] (03PS1) 10Muehlenhoff: Update bridge detection code [cookbooks] - 10https://gerrit.wikimedia.org/r/786903 [10:20:57] (03CR) 10jerkins-bot: [V: 04-1] Update bridge detection code [cookbooks] - 10https://gerrit.wikimedia.org/r/786903 (owner: 10Muehlenhoff) [10:22:10] (03PS2) 10Muehlenhoff: Update bridge detection code [cookbooks] - 10https://gerrit.wikimedia.org/r/786903 [10:25:58] !log rolling upgrade trafficserver to 8.0.8-1wm6 on codfw [10:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:40:59] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe2012.codfw.wmnet with OS bullseye [10:41:03] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe2012.codfw.wmnet with OS bullseye [10:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:02] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/786903 (owner: 10Muehlenhoff) [10:51:53] !log rolling upgrade trafficserver to 8.0.8-1wm6 on eqsin [10:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:28] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:58:43] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2012.codfw.wmnet with reason: host reimage [10:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:12] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2012.codfw.wmnet with reason: host reimage [11:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:19] PROBLEM - DPKG on deneb is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:37] (03PS1) 10Kevin Bazira: ml-services: add wikidatawiki & zhwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) [11:15:43] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2012.codfw.wmnet with OS bullseye [11:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:50] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe2012.codfw.wmnet with OS bullseye completed: - ms-fe2012 (**WARN**) - Downtim... [11:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:24:08] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe2011.codfw.wmnet with OS bullseye [11:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:13] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe2011.codfw.wmnet with OS bullseye [11:34:07] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) I think the best option is to use OIDC, however that comes with a couple of caveats. 1) We don't currently have OIDC support enabled in CAS so there could be some teething i... [11:35:22] Hi [11:35:41] Is the Linter service a WMF service or one on Labs/Toolforge? [11:35:54] I am asking because it's updates seem to have stalled [11:41:42] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2011.codfw.wmnet with reason: host reimage [11:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:46] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-cluster [11:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:41] (03CR) 10Jbond: WIP move core routers definitions to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [11:44:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2011.codfw.wmnet with reason: host reimage [11:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:52] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/786365 (owner: 10Jbond) [11:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:58:16] (03CR) 10Muehlenhoff: [C: 03+2] Update bridge detection code [cookbooks] - 10https://gerrit.wikimedia.org/r/786903 (owner: 10Muehlenhoff) [11:58:46] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2011.codfw.wmnet with OS bullseye [11:58:49] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe2011.codfw.wmnet with OS bullseye completed: - ms-fe2011 (**WARN**) - Downtim... [11:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:45] (03PS1) 10Ladsgroup: Set actor migration to READ NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786936 (https://phabricator.wikimedia.org/T275246) [12:04:27] jouncebot: nowandnext [12:04:27] No deployments scheduled for the next 0 hour(s) and 55 minute(s) [12:04:27] In 0 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220427T1300) [12:04:32] awesome [12:04:57] (03CR) 10Ladsgroup: [C: 03+2] Set actor migration to READ NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786936 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [12:05:03] heads up marostegui ^ [12:05:11] oh cool [12:05:41] (03Merged) 10jenkins-bot: Set actor migration to READ NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786936 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [12:07:10] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:786936|Set actor migration to READ NEW everywhere (T275246)]] (duration: 00m 53s) [12:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:16] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [12:07:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe2010.codfw.wmnet with OS bullseye [12:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:30] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe2010.codfw.wmnet with OS bullseye [12:07:32] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) Thanks for the updates. Sounds like a good plan! In terms of the configuration for the CR "nodeSelector" filter I thi... [12:10:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:10:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:10:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:40] I confirm queries have changed to use rev_actor now https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2022.04.27?id=vEruaoABxNOh7eZ4cF-v [12:14:18] :o [12:14:48] (03PS1) 10Muehlenhoff: Update to 6.4.6.3 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/786938 [12:15:50] !log rolling upgrade trafficserver to 8.0.8-1wm6 on esams [12:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:00] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2010.codfw.wmnet with reason: host reimage [12:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:24] !log rolling upgrade trafficserver to 8.0.8-1wm6 on drmrs [12:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:41] (03PS1) 10Jbond: ssh-client-config: ensure cloudcontrol serveres us the correct jumphost [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/786940 [12:28:25] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2010.codfw.wmnet with reason: host reimage [12:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:53] (03PS4) 10Klausman: admin_ng: Add config for ML staging k8s in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) [12:30:13] (03PS1) 10Filippo Giunchedi: clinic-duty: add euNetworks support [software] - 10https://gerrit.wikimedia.org/r/786941 [12:36:58] (03CR) 10Vivian Rook: [C: 03+1] wmcs.codfw1: use the correct memcached port for the exporter [puppet] - 10https://gerrit.wikimedia.org/r/786330 (owner: 10David Caro) [12:42:05] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2010.codfw.wmnet with OS bullseye [12:42:09] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe2010.codfw.wmnet with OS bullseye completed: - ms-fe2010 (**WARN**) - Downtim... [12:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:15] (03CR) 10Muehlenhoff: ssh-client-config: ensure cloudcontrol serveres us the correct jumphost (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/786940 (owner: 10Jbond) [12:42:49] !log rolling upgrade trafficserver to 8.0.8-1wm6 on eqiad [12:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:08] (03PS2) 10Kormat: pc1014: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/786317 (https://phabricator.wikimedia.org/T303174) [12:43:50] (03CR) 10jerkins-bot: [V: 04-1] pc1014: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/786317 (https://phabricator.wikimedia.org/T303174) (owner: 10Kormat) [12:44:44] (03PS3) 10Kormat: pc1014: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/786317 (https://phabricator.wikimedia.org/T303174) [12:46:45] (03CR) 10Marostegui: [C: 03+1] pc1014: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/786317 (https://phabricator.wikimedia.org/T303174) (owner: 10Kormat) [12:47:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:47:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [12:47:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298556)', diff saved to https://phabricator.wikimedia.org/P26708 and previous config saved to /var/cache/conftool/dbconfig/20220427-124714-ladsgroup.json [12:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:32] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [12:47:47] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2040.codfw.wmnet with OS bullseye [12:47:50] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye [12:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:32] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [12:49:32] (03CR) 10Kormat: [C: 03+2] pc1014: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/786317 (https://phabricator.wikimedia.org/T303174) (owner: 10Kormat) [12:50:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298556)', diff saved to https://phabricator.wikimedia.org/P26709 and previous config saved to /var/cache/conftool/dbconfig/20220427-125022-ladsgroup.json [12:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:13] !log moved pc1014 to pc2 T303174 [12:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [12:54:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [12:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T306560)', diff saved to https://phabricator.wikimedia.org/P26710 and previous config saved to /var/cache/conftool/dbconfig/20220427-125427-ladsgroup.json [12:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:36] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [12:57:09] (03CR) 10Filippo Giunchedi: C:monitoring: Add define for creating http checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786365 (owner: 10Jbond) [12:57:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T306560)', diff saved to https://phabricator.wikimedia.org/P26712 and previous config saved to /var/cache/conftool/dbconfig/20220427-125752-ladsgroup.json [12:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: Dear deployers, time to do the UTC afternoon backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220427T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:01:09] (03PS1) 10KartikMistry: Enable SectionTranslation in testwiki for Punjabi, Tsonga, Nepali, and Swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786947 (https://phabricator.wikimedia.org/T304828) [13:02:00] (03PS2) 10KartikMistry: Enable Section Translation for Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786446 (https://phabricator.wikimedia.org/T304862) [13:03:55] (03PS1) 10Jelto: icinga: increase retries and delay for icinga status check [software/spicerack] - 10https://gerrit.wikimedia.org/r/786949 [13:03:57] (03PS2) 10KartikMistry: Enable SectionTranslation in testwiki for Punjabi, Tsonga, Nepali, and Swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786947 (https://phabricator.wikimedia.org/T304828) [13:05:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26714 and previous config saved to /var/cache/conftool/dbconfig/20220427-130527-ladsgroup.json [13:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:37] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [13:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:16] (03CR) 10Volans: "question inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/786949 (owner: 10Jelto) [13:11:09] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) On reflection the above won't work if we're going to add the 'node-location' for all existing hosts, which I assume is... [13:11:46] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2040.codfw.wmnet with OS bullseye [13:11:49] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye executed with errors: - ms-be2040 (**FAIL**)... [13:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:37] (03CR) 10jerkins-bot: [V: 04-1] icinga: increase retries and delay for icinga status check [software/spicerack] - 10https://gerrit.wikimedia.org/r/786949 (owner: 10Jelto) [13:12:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26715 and previous config saved to /var/cache/conftool/dbconfig/20220427-131257-ladsgroup.json [13:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:03] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2040.codfw.wmnet with OS bullseye [13:13:07] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye [13:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:43] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove per-exporter up checks [puppet] - 10https://gerrit.wikimedia.org/r/784636 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [13:19:29] 10SRE, 10DBA, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Kormat) [13:19:33] (03CR) 10Elukey: admin_ng: Add config for ML staging k8s in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:20:03] (03PS1) 10Kormat: ProductionServices: Promote pc1014 to primary of pc2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786955 (https://phabricator.wikimedia.org/T306983) [13:20:30] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Kormat) [13:20:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26716 and previous config saved to /var/cache/conftool/dbconfig/20220427-132032-ladsgroup.json [13:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:40] (03PS5) 10Klausman: admin_ng: Add config for ML staging k8s in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) [13:21:34] (03CR) 10Klausman: admin_ng: Add config for ML staging k8s in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:24:09] (03PS1) 10Andrew Bogott: Update recursor IPs for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/786957 [13:25:42] (03CR) 10Andrew Bogott: [C: 03+2] Update recursor IPs for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/786957 (owner: 10Andrew Bogott) [13:26:53] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 (10Kormat) [13:28:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26717 and previous config saved to /var/cache/conftool/dbconfig/20220427-132802-ladsgroup.json [13:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:53] (03CR) 10Elukey: [C: 03+1] "LGTM! Can you and Tobias coordinate to merge/deploy it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [13:35:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [13:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298556)', diff saved to https://phabricator.wikimedia.org/P26718 and previous config saved to /var/cache/conftool/dbconfig/20220427-133537-ladsgroup.json [13:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:43] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [13:35:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Andrew) [13:36:06] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Neutron networking not working for cloudnet200[5,6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306861 (10Andrew) 05Open→03Resolved >>! In T306861#7879492, @Papaul wrote: > @Andrew the racking task for the cloudnet nodes said to set... [13:36:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [13:36:58] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2040.codfw.wmnet with OS bullseye [13:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:02] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2040.codfw.wmnet with OS bullseye executed with errors: - ms-be2040 (**FAIL**)... [13:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:54] (03CR) 10Elukey: "Adding Janis for some feedback about how to best add the ml staging cluster. The option that we are evaluating is to keep ml-serve.yaml as" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:38:15] (03PS1) 10David Caro: Move from deprecated icinga_hosts to alerting_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) [13:40:42] (03CR) 10Volans: [C: 04-1] Move from deprecated icinga_hosts to alerting_hosts (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) (owner: 10David Caro) [13:40:56] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus3001.esams.wmnet [13:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:54] (03CR) 10jerkins-bot: [V: 04-1] Move from deprecated icinga_hosts to alerting_hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) (owner: 10David Caro) [13:43:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T306560)', diff saved to https://phabricator.wikimedia.org/P26719 and previous config saved to /var/cache/conftool/dbconfig/20220427-134308-ladsgroup.json [13:43:08] (03CR) 10JMeybohm: admin_ng: Add config for ML staging k8s in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:14] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [13:43:58] (03CR) 10Klausman: admin_ng: Add config for ML staging k8s in codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:45:03] (03CR) 10JMeybohm: [C: 03+2] Update miscweb relates records for use with k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/786322 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [13:45:40] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3001.esams.wmnet [13:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:32] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus5001.eqsin.wmnet [13:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:58] !log rebalance ganeti-test after adding new bullseye node T306499 [13:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:03] T306499: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 [13:47:38] !log jayme@cumin1001 START - Cookbook sre.dns.netbox [13:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:57] (03CR) 10Elukey: [C: 03+1] admin_ng: Add config for ML staging k8s in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:51:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:34] (03CR) 10Kevin Bazira: ml-services: add wikidatawiki & zhwiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [13:52:14] (03CR) 10Klausman: [C: 03+2] admin_ng: Add config for ML staging k8s in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:52:41] 10SRE-swift-storage, 10ops-codfw: upgrade firmware on ms-be2040 - https://phabricator.wikimedia.org/T306988 (10MatthewVernon) [13:53:16] 10SRE-swift-storage, 10ops-codfw: upgrade firmware on ms-be2040 - https://phabricator.wikimedia.org/T306988 (10MatthewVernon) [13:53:19] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [13:53:33] 10SRE-swift-storage, 10ops-codfw: upgrade firmware on ms-be2040 - https://phabricator.wikimedia.org/T306988 (10MatthewVernon) p:05Triage→03High [13:53:40] (03CR) 10David Caro: Move from deprecated icinga_hosts to alerting_hosts (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) (owner: 10David Caro) [13:53:57] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5001.eqsin.wmnet [13:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:23] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus4001.ulsfo.wmnet [13:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:09] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [13:55:18] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [13:57:05] (03Merged) 10jenkins-bot: admin_ng: Add config for ML staging k8s in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/786808 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:59:06] (03CR) 10Klausman: [C: 03+1] ml-services: add wikidatawiki & zhwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [13:59:28] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) Happening again: ` dwalden@deployment-mediawiki12:~$ sudo tail /var/log/apache2.log Apr 27 13:58:00 d... [13:59:51] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4001.ulsfo.wmnet [13:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:13] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus6001.drmrs.wmnet [14:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:36] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): decommission cloudnet200[2,4]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306989 (10Andrew) [14:03:54] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:03:55] (03PS1) 10Andrew Bogott: Prepare cloudnet2002-dev and cloudnet2004-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/786967 (https://phabricator.wikimedia.org/T306989) [14:03:57] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:23] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) We're now blocked on backend reimaging. ms-be2040 fails to PXE-boot (gets as far as `Probing EDD (edd=off to disable)... ok`), so we're going to try a firmware upgr... [14:04:54] (03CR) 10Andrew Bogott: [C: 03+2] Prepare cloudnet2002-dev and cloudnet2004-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/786967 (https://phabricator.wikimedia.org/T306989) (owner: 10Andrew Bogott) [14:07:16] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6001.drmrs.wmnet [14:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:07:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:49] (03CR) 10Klausman: [C: 03+1] "Aside from Luca's nit, LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [14:08:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1100.eqiad.wmnet with reason: Maintenance [14:08:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1100.eqiad.wmnet with reason: Maintenance [14:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:58] (03CR) 10Jbond: "there also seems to be a few changes in the Gradle files, these mostly don't need to be applied but we have hit minor issues by not being " [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/786938 (owner: 10Muehlenhoff) [14:09:29] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2002.codfw.wmnet [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:19] (03PS4) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [14:10:58] (03PS5) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [14:11:14] (03PS2) 10Kevin Bazira: ml-services: add wikidatawiki & zhwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) [14:11:42] (03CR) 10JMeybohm: [C: 03+2] Remove miscweb discovery resources [puppet] - 10https://gerrit.wikimedia.org/r/786323 (https://phabricator.wikimedia.org/T305358) (owner: 10JMeybohm) [14:11:58] (03CR) 10Muehlenhoff: Update to 6.4.6.3 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/786938 (owner: 10Muehlenhoff) [14:12:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [14:12:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [14:12:11] (03CR) 10Kevin Bazira: ml-services: add wikidatawiki & zhwiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [14:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P26720 and previous config saved to /var/cache/conftool/dbconfig/20220427-141215-ladsgroup.json [14:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:24] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [14:13:15] (03CR) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [14:13:27] (03Abandoned) 10Jbond: ssh-client-config: ensure cloudcontrol serveres us the correct jumphost [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/786940 (owner: 10Jbond) [14:15:25] (03CR) 10Hnowlan: "Look fine to me - just so I don't miss anything, are there any significant divergences between this internal version and the upstream tc_c" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/785127 (https://phabricator.wikimedia.org/T305053) (owner: 10Roman Stolar) [14:15:34] (03PS3) 10Klausman: ml-services: add wikidatawiki & zhwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [14:15:43] (03CR) 10Klausman: [C: 03+2] ml-services: add wikidatawiki & zhwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [14:15:53] (03Abandoned) 10Hnowlan: tegola: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/784246 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [14:16:02] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: add wikidatawiki & zhwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786924 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [14:16:18] (03PS4) 10JMeybohm: Move miscweb back to state production [puppet] - 10https://gerrit.wikimedia.org/r/774917 (https://phabricator.wikimedia.org/T290966) [14:18:04] (03CR) 10Hnowlan: [Beta Cluster] LabsServices: Move to buster restbase host (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766602 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [14:19:03] (03PS17) 10Herron: prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) [14:20:01] (03CR) 10Herron: prometheus: enable prometheus web access via proxy with IDP (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [14:21:24] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:00] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2002.codfw.wmnet [14:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:14] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773614 (https://phabricator.wikimedia.org/T304591) (owner: 10Razzi) [14:23:19] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:06] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:23] 10SRE, 10Infrastructure-Foundations, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10NHillard-WMF) Hi all -- Apologies for the delay, this notification had gotten lost. Thank you for following up. I re-tested just now and have access to almost a... [14:27:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudnet200[2,4]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306989 (10Papaul) @Andrew this is assigned to me but the "Steps for service owner:" are not checked. [14:30:04] (03CR) 10JMeybohm: [C: 03+2] Move miscweb back to state production [puppet] - 10https://gerrit.wikimedia.org/r/774917 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:34:31] (03CR) 10Jaime Nuche: "Hi!, I'm gonna need this change soon. Majavah, if you're happy with the latest patchset, can we have this merged?" [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [14:35:00] (03CR) 10Jbond: [C: 03+1] Update to 6.4.6.3 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/786938 (owner: 10Muehlenhoff) [14:35:43] (03PS1) 10Ladsgroup: Enable videojs in eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786976 (https://phabricator.wikimedia.org/T303785) [14:35:50] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update to 6.4.6.3 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/786938 (owner: 10Muehlenhoff) [14:37:22] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34967/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [14:37:32] jouncebot: nowandnext [14:37:32] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [14:37:32] In 1 hour(s) and 52 minute(s): Deploy Wikistories to beta cluster (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220427T1630) [14:37:54] 10SRE, 10Infrastructure-Foundations, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10Dzahn) Hi @NHillard-WMF great to hear most things work now! Thanks for confirming. Regarding Piwik (renamed to Matomo), I found this: https://wikitech.wikimed... [14:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:38:50] (03CR) 10Ladsgroup: [C: 03+2] Enable videojs in eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786976 (https://phabricator.wikimedia.org/T303785) (owner: 10Ladsgroup) [14:39:28] (03CR) 10Majavah: [V: 03+1 C: 03+1] P:scap::dsh: Add scap targets as a dsh group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [14:39:49] (03Merged) 10jenkins-bot: Enable videojs in eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786976 (https://phabricator.wikimedia.org/T303785) (owner: 10Ladsgroup) [14:40:59] (03CR) 10Hnowlan: [C: 04-1] "Do we need a codfw entry, a DYNA geoip entry and PTR records for eqiad/codfw in templates/10.in-addr.arpa for the service?" [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [14:41:01] (03CR) 10Jbond: ssh-client-config: ensure cloudcontrol serveres us the correct jumphost (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/786940 (owner: 10Jbond) [14:41:30] (03CR) 10Vgutierrez: [C: 03+1] varnish: switch to using new-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778544 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [14:42:40] !log imported cas 6.4.6.3 to apt.wikimedia.org [14:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:34] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:786976|Enable videojs in eswiki (T303785 T248418)]] (duration: 00m 51s) [14:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:41] T303785: Account creation: social media landing pages - https://phabricator.wikimedia.org/T303785 [14:43:41] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [14:44:37] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [14:44:46] (03CR) 10Volans: [C: 04-1] "reply inline" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/786960 (https://phabricator.wikimedia.org/T304533) (owner: 10David Caro) [14:45:02] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [14:46:28] (03CR) 10Jbond: P:scap::dsh: Add scap targets as a dsh group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [14:46:35] 10SRE, 10Infrastructure-Foundations, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10NHillard-WMF) Ah OK - I see now that if I go to a sub-page I have access. Thanks for the references. We should be all good then! Thanks again for the help. [14:46:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:46:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:46:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:59] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2003.codfw.wmnet [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:34] 10SRE, 10serviceops: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10Jdforrester-WMF) [14:49:44] (03PS1) 10JMeybohm: trafficserver: change miscweb backend back to miscweb.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/786977 (https://phabricator.wikimedia.org/T290966) [14:50:46] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [14:53:46] !log razzi@cumin1001 conftool action : set/pooled=yes; selector: service=cloudceph,name=cloudcephmon1003.eqiad.wmnet [14:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:02] !log razzi@cumin1001 conftool action : set/pooled=no; selector: service=cloudceph,name=cloudcephmon1003.eqiad.wmnet [14:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:19] !log razzi@cumin1001 conftool action : set/pooled=inactive; selector: service=cloudceph,name=cloudcephmon1003.eqiad.wmnet [14:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:34] (03PS1) 10Muehlenhoff: Failover idp to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/786981 [14:56:39] (03CR) 10Jaime Nuche: P:scap::dsh: Add scap targets as a dsh group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771441 (https://phabricator.wikimedia.org/T303559) (owner: 10Jbond) [14:57:55] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2003.codfw.wmnet [14:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:17] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Neutron networking not working for cloudnet200[5,6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306861 (10Andrew) 05Resolved→03Open I spoke too soon! I now see that VMs can't talk to cloudserviceshosts: ` root@tools-codfw1dev-k8s... [14:58:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Andrew) [14:58:40] (03PS1) 10Andrew Bogott: Revert "Prepare cloudnet2002-dev and cloudnet2004-dev for decom" [puppet] - 10https://gerrit.wikimedia.org/r/786415 [14:59:06] (03CR) 10Vgutierrez: [C: 03+1] "https://miscweb.discovery.wmnet:30443 looks good from a TLS point of view" [puppet] - 10https://gerrit.wikimedia.org/r/786977 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [14:59:38] vgutierrez: thanks! [15:00:03] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Prepare cloudnet2002-dev and cloudnet2004-dev for decom" [puppet] - 10https://gerrit.wikimedia.org/r/786415 (owner: 10Andrew Bogott) [15:01:49] (03CR) 10Jbond: [C: 03+1] Failover idp to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/786981 (owner: 10Muehlenhoff) [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:15] 10SRE, 10Infrastructure-Foundations, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10jcrespo) 05Open→03Resolved a:03jcrespo [15:02:28] !log installing mariadb-10.5 updates (as packaged in Debian Bullseye, unrelated to wmf-mariadb packages) [15:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:33] (03CR) 10JMeybohm: [C: 03+2] trafficserver: change miscweb backend back to miscweb.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/786977 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:10:03] (03PS1) 10Kevin Bazira: ml-services: fix wikidatawiki & zhwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786982 (https://phabricator.wikimedia.org/T301415) [15:10:06] PROBLEM - MariaDB Replica Lag: x1 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 888.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:10:56] ^ looking [15:10:57] (03PS1) 10Hashar: gerrit: keep computing changes mergeability [puppet] - 10https://gerrit.wikimedia.org/r/786984 (https://phabricator.wikimedia.org/T303970) [15:11:23] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:14:00] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [15:17:28] (03PS1) 10Jbond: reposync: improve push error handeling [software/spicerack] - 10https://gerrit.wikimedia.org/r/786989 [15:17:28] RECOVERY - MariaDB Replica Lag: x1 on dbstore1005 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:17:40] (03CR) 10David Caro: [C: 03+2] sallogger: send to #wikimedia-cloud-feed instead [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773309 (owner: 10RhinosF1) [15:18:09] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [15:21:22] (03Merged) 10jenkins-bot: sallogger: send to #wikimedia-cloud-feed instead [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773309 (owner: 10RhinosF1) [15:21:24] (03CR) 10Klausman: [C: 03+1] ml-services: fix wikidatawiki & zhwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786982 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [15:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:25:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: switch to using new-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778544 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [15:27:32] (03CR) 10Klausman: [C: 03+2] ml-services: fix wikidatawiki & zhwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786982 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [15:29:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P26722 and previous config saved to /var/cache/conftool/dbconfig/20220427-152944-ladsgroup.json [15:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:51] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [15:31:25] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/786989 (owner: 10Jbond) [15:31:54] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:25] (03CR) 10Jelto: icinga: increase retries and delay for icinga status check (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/786949 (owner: 10Jelto) [15:33:10] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:23] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:16] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:07] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/786949 (owner: 10Jelto) [15:37:56] (03CR) 10David Caro: Create REST api service to manage toolforge replica.my.cnf (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:39:43] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2004.codfw.wmnet [15:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:17] (03CR) 10Jbond: [C: 03+2] reposync: improve push error handeling [software/spicerack] - 10https://gerrit.wikimedia.org/r/786989 (owner: 10Jbond) [15:41:32] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb={CREATE,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:42:47] (03PS3) 10Sbisson: Add Wikistories extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773212 (https://phabricator.wikimedia.org/T303004) [15:43:12] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:44:18] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [15:44:22] (03PS3) 10Sbisson: Enable Wikistories on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773334 (https://phabricator.wikimedia.org/T303004) [15:44:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26723 and previous config saved to /var/cache/conftool/dbconfig/20220427-154449-ladsgroup.json [15:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:17] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@6684963]: (no justification provided) [15:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:25] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@6684963]: (no justification provided) (duration: 00m 08s) [15:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:45] 10SRE, 10serviceops: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10bd808) From https://nodejs.org/en/about/releases/: |Release |Status |Initial Release |Active LTS Start |Maintenance LTS Start |End-of-life | | --- | --- | --- | --- | --- | --- |... [15:45:48] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet [15:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:11] (03PS1) 10Andrew Bogott: Revert "Update recursor IPs for codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/786416 [15:47:36] (03Merged) 10jenkins-bot: reposync: improve push error handeling [software/spicerack] - 10https://gerrit.wikimedia.org/r/786989 (owner: 10Jbond) [15:47:57] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet [15:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:20] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet [15:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:35] (03CR) 10Hashar: [C: 03+2] [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/756109 (owner: 10Hashar) [15:49:48] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Update recursor IPs for codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/786416 (owner: 10Andrew Bogott) [15:50:18] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet [15:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:31] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1003.eqiad.wmnet [15:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:45] (03CR) 10JMeybohm: [C: 03+2] Fix permissions/ownership of helm directories [puppet] - 10https://gerrit.wikimedia.org/r/786269 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [15:50:47] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Clean up helm2 specific code and environment variable [puppet] - 10https://gerrit.wikimedia.org/r/786270 (owner: 10JMeybohm) [15:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:52:23] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2004.codfw.wmnet [15:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:10] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:56:13] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:58] (03Merged) 10jenkins-bot: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/756109 (owner: 10Hashar) [15:58:10] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1003.eqiad.wmnet [15:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:29] (03PS1) 10JMeybohm: Revert "Revert "Revert "mwdebug_deploy: switch back to using the root user""" [puppet] - 10https://gerrit.wikimedia.org/r/786418 [15:58:41] (03PS2) 10JMeybohm: Revert "Revert "Revert "mwdebug_deploy: switch back to using the root user""" [puppet] - 10https://gerrit.wikimedia.org/r/786418 [15:59:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P26724 and previous config saved to /var/cache/conftool/dbconfig/20220427-155954-ladsgroup.json [15:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:52] (03PS1) 10Hashar: Gerrit v3.4.3 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) [16:01:11] (03CR) 10jerkins-bot: [V: 04-1] Gerrit v3.4.3 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [16:04:13] !log foreachwikiindblist s6 mysql.php -- -e "desc querycache_info;" | grep -i qci_timestamp | grep -i varbinary | awk '{ print substr($1, 1, length($1)-1) }' | xargs -I {} sql {} --write -- -e 'ALTER TABLE /*_*/querycache_info CHANGE qci_timestamp qci_timestamp BINARY(14) DEFAULT '19700101000000' NOT NULL;' (T298559) [16:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:19] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [16:05:29] (03CR) 10JMeybohm: [C: 03+1] add a namespace for new service image-suggestions (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [16:05:55] (03PS3) 10Giuseppe Lavagetto: varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) [16:06:29] (03CR) 10jerkins-bot: [V: 04-1] varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [16:06:46] (03CR) 10Hashar: "recheck after ./deploy_artifacts.py --version=3.4.3 gerrit.war plugins/*" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [16:07:08] (03PS1) 10Majavah: P:openstack::rabbitmq: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/787003 [16:08:13] (03PS4) 10Giuseppe Lavagetto: varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) [16:08:29] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34968/console" [puppet] - 10https://gerrit.wikimedia.org/r/787003 (owner: 10Majavah) [16:10:38] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [16:12:05] (03PS1) 10Elukey: kserve-inference: add the wikidata use case to revscoring_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/787004 [16:12:47] (03CR) 10Paladox: [C: 03+1] gerrit: keep computing changes mergeability [puppet] - 10https://gerrit.wikimedia.org/r/786984 (https://phabricator.wikimedia.org/T303970) (owner: 10Hashar) [16:13:17] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Neutron networking not working for cloudnet200[5,6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306861 (10Andrew) ... I'm now less sure about what's happening here, so stand by :) [16:13:44] (03PS5) 10Giuseppe Lavagetto: varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) [16:14:12] (03PS2) 10Elukey: kserve-inference: add the wikidata use case to revscoring_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/787004 [16:14:44] (03PS1) 10Andrew Bogott: profile::openstack::codfw1dev::networktests: update recursor IP for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/787005 [16:15:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T306560)', diff saved to https://phabricator.wikimedia.org/P26725 and previous config saved to /var/cache/conftool/dbconfig/20220427-161459-ladsgroup.json [16:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:06] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:15:21] (03CR) 10Umherirrender: [C: 03+1] Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [16:17:20] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::codfw1dev::networktests: update recursor IP for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/787005 (owner: 10Andrew Bogott) [16:20:14] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34970/console" [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [16:23:57] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] varnish: remove old-style request filters [puppet] - 10https://gerrit.wikimedia.org/r/778545 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [16:25:16] (03CR) 10Elukey: [C: 03+2] kserve-inference: add the wikidata use case to revscoring_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/787004 (owner: 10Elukey) [16:28:24] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:04] Urbanecm and stephanebisson: That opportune time is upon us again. Time for a Deploy Wikistories to beta cluster deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220427T1630). [16:30:51] Hey urbanecm o/ [16:31:27] (03PS1) 10Andrew Bogott: Prepare cloudnet2002-dev and cloudnet2004-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/787012 (https://phabricator.wikimedia.org/T306989) [16:32:40] (03CR) 10Andrew Bogott: [C: 03+2] Prepare cloudnet2002-dev and cloudnet2004-dev for decom [puppet] - 10https://gerrit.wikimedia.org/r/787012 (https://phabricator.wikimedia.org/T306989) (owner: 10Andrew Bogott) [16:32:57] (03CR) 10Hashar: "I have tested it with Gerrit 3.4.3 from https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/787000/ and confirmed is:mergeable " [puppet] - 10https://gerrit.wikimedia.org/r/786984 (https://phabricator.wikimedia.org/T303970) (owner: 10Hashar) [16:34:50] !log razzi@cumin1001 conftool action : set/pooled=yes; selector: service=cloudceph,name=cloudcephmon1003.eqiad.wmnet [16:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:12] !log razzi@cumin1001 conftool action : set/pooled=inactive; selector: service=cloudceph,name=cloudcephmon1003.eqiad.wmnet [16:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:37] (03PS3) 10Razzi: dbproxy: add clouddb sections to conftool [puppet] - 10https://gerrit.wikimedia.org/r/779926 (https://phabricator.wikimedia.org/T304478) [16:36:05] razzi: umh, may I ask why you're touching ceph modes on conftool? those aren't related to wikireplicas [16:39:01] (03CR) 10JHathaway: [C: 03+2] gerrit: keep computing changes mergeability [puppet] - 10https://gerrit.wikimedia.org/r/786984 (https://phabricator.wikimedia.org/T303970) (owner: 10Hashar) [16:39:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Andrew) [16:39:54] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Neutron networking not working for cloudnet200[5,6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306861 (10Andrew) 05Open→03Resolved Whatever this is, it wasn't what I thought it was. Working now. [16:41:11] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1001.eqiad.wmnet [16:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:42] PROBLEM - Confd template for /etc/varnish/dynamic.actions.inc.vcl on cp1081 is CRITICAL: File not found: /etc/varnish/dynamic.actions.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:43:02] !log dancy@deploy1002 Started deploy [releng/phatality@d8e2adc]: (no justification provided) [16:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:16] !log dancy@deploy1002 Finished deploy [releng/phatality@d8e2adc]: (no justification provided) (duration: 00m 13s) [16:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:31] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:04] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudnet[2002,2004]-dev.codfw.wmnet [16:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:36] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts cloudnet[2002,2004]-dev.codfw.wmnet [16:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:58] (03PS1) 10Brennen Bearnes: Fix warnings relating to QuickTemplate [skins/WikimediaApiPortal] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/786420 (https://phabricator.wikimedia.org/T306925) [16:53:58] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudnet[2002,2004]-dev.codfw.wmnet [16:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:33] (03CR) 10Brennen Bearnes: "Readying a cherry pick here in case this spikes log noise dramatically." [skins/WikimediaApiPortal] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/786420 (https://phabricator.wikimedia.org/T306925) (owner: 10Brennen Bearnes) [16:56:21] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be1001.eqiad.wmnet [16:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:53] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [16:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:06] (03PS1) 10Andrew Bogott: Remove references to cloudnet200[2,4].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787022 [16:59:51] (03CR) 10jerkins-bot: [V: 04-1] Remove references to cloudnet200[2,4].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787022 (owner: 10Andrew Bogott) [17:01:03] (03PS2) 10Andrew Bogott: Remove references to cloudnet200[2,4].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787022 (https://phabricator.wikimedia.org/T306989) [17:01:46] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:16] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudnet200[2,4].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/787022 (https://phabricator.wikimedia.org/T306989) (owner: 10Andrew Bogott) [17:10:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudnet[2002,2004]-dev.codfw.wmnet [17:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudnet200[2,4]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306989 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudnet[2002,2004]-dev.codfw.wmne... [17:11:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudnet200[2,4]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306989 (10Andrew) [17:11:20] RECOVERY - DNS on thumbor2005.mgmt is OK: DNS OK: 0.009 seconds response time. thumbor2005.mgmt.codfw.wmnet returns 10.193.0.182 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:16:06] (03CR) 10Urbanecm: [C: 03+2] Add Wikistories extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773212 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [17:16:51] (03Merged) 10jenkins-bot: Add Wikistories extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773212 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [17:20:12] urbanecm: Please ping me when you're done. [17:20:38] dancy: sure thing [17:21:37] (03CR) 10Urbanecm: [C: 03+2] Enable Wikistories on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773334 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [17:22:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:22:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [17:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298563)', diff saved to https://phabricator.wikimedia.org/P26727 and previous config saved to /var/cache/conftool/dbconfig/20220427-172211-ladsgroup.json [17:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:21] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [17:22:23] (03Merged) 10jenkins-bot: Enable Wikistories on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773334 (https://phabricator.wikimedia.org/T303004) (owner: 10Sbisson) [17:23:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:23:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:23:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [17:26:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [17:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298558)', diff saved to https://phabricator.wikimedia.org/P26728 and previous config saved to /var/cache/conftool/dbconfig/20220427-172653-ladsgroup.json [17:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:01] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [17:27:16] !log urbanecm@deploy1002 Synchronized wmf-config/extension-list: 01dfaf063d14ee329c43d65566270ff3cec48d50: Add Wikistories extension (T303004) (duration: 00m 49s) [17:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:21] T303004: Deploy Wikistories extension to beta cluster - https://phabricator.wikimedia.org/T303004 [17:28:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 88ac6b94ff5acfbcd6a539e21b7454b6a38eaaba: Enable Wikistories on enwiki beta (T303004; 1/2) (duration: 00m 51s) [17:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:34] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10RobH) 05Open→03Resolved This is out of warranty so I've created T307021 to determine if we replace it in esams, drmrs, and with a virtual or actual anchor. Since this is now defunct, I'll create and link i... [17:28:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:28:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:58] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: 88ac6b94ff5acfbcd6a539e21b7454b6a38eaaba: Enable Wikistories on enwiki beta (T303004; 2/2) (duration: 00m 50s) [17:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298558)', diff saved to https://phabricator.wikimedia.org/P26729 and previous config saved to /var/cache/conftool/dbconfig/20220427-172900-ladsgroup.json [17:29:04] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1002.eqiad.wmnet [17:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:57] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH) [17:33:10] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH) [17:33:16] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10RobH) [17:34:16] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH) @ayounsi & @cmooney: Would one of you take care of disabling this atlas anchor on our RIPE account and if needed, rotating any private keys or creds that m... [17:34:53] dancy: you should be fine to go ahead now [17:34:58] thx [17:35:12] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10RobH) [17:35:26] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10RobH) [17:35:42] !log dancy@deploy1002 Started scap: (no justification provided) [17:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:03] !log dancy@deploy1002 Testing image build and deployment [17:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298563)', diff saved to https://phabricator.wikimedia.org/P26730 and previous config saved to /var/cache/conftool/dbconfig/20220427-173709-ladsgroup.json [17:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:15] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [17:39:26] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1002.eqiad.wmnet [17:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:20] (03CR) 10Razzi: [C: 03+2] wikireplicas: remove wb_changes_dispatch view for dropped table [puppet] - 10https://gerrit.wikimedia.org/r/773614 (https://phabricator.wikimedia.org/T304591) (owner: 10Razzi) [17:44:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26731 and previous config saved to /var/cache/conftool/dbconfig/20220427-174405-ladsgroup.json [17:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:11] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:49:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:49:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:50:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:51:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P26732 and previous config saved to /var/cache/conftool/dbconfig/20220427-175100-ladsgroup.json [17:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:13] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:52:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P26733 and previous config saved to /var/cache/conftool/dbconfig/20220427-175212-ladsgroup.json [17:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26734 and previous config saved to /var/cache/conftool/dbconfig/20220427-175220-ladsgroup.json [17:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:54:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:54:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P26735 and previous config saved to /var/cache/conftool/dbconfig/20220427-175910-ladsgroup.json [17:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] brennen and jeena: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220427T1800). [18:00:05] brennen and jeena: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220427T1800). [18:00:08] o/ [18:00:11] !log train 1.39.0-wmf.9 (T305215): no current blockers, proceeding to group1 [18:00:12] o/ [18:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:17] (03CR) 10Jdlrobson: [C: 03+1] Fix warnings relating to QuickTemplate [skins/WikimediaApiPortal] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/786420 (https://phabricator.wikimedia.org/T306925) (owner: 10Brennen Bearnes) [18:00:18] T305215: 1.39.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T305215 [18:00:40] brennen: I have a scap sync-world in progress to build a container. It should be done in a few minutes [18:00:54] dancy: ack, thanks for heads up. [18:01:06] (shoulda looked closer at backscroll.) [18:02:18] while i'm thinking about it, i'll go ahead with that QuickTemplate backport once that's done and before doing the train. [18:02:28] may as well get rid of that noise. (cc: Jdlrobson) [18:02:42] brennen: I'm done. All yours [18:02:47] thx [18:03:04] (03CR) 10Brennen Bearnes: [C: 03+2] Fix warnings relating to QuickTemplate [skins/WikimediaApiPortal] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/786420 (https://phabricator.wikimedia.org/T306925) (owner: 10Brennen Bearnes) [18:03:12] That'll be nice [18:04:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:04:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:04:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:15] (03Merged) 10jenkins-bot: Fix warnings relating to QuickTemplate [skins/WikimediaApiPortal] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/786420 (https://phabricator.wikimedia.org/T306925) (owner: 10Brennen Bearnes) [18:07:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26736 and previous config saved to /var/cache/conftool/dbconfig/20220427-180717-ladsgroup.json [18:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26737 and previous config saved to /var/cache/conftool/dbconfig/20220427-180725-ladsgroup.json [18:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:12] (03PS1) 10Cwhite: patch 1.11.1: replace elasticsearch-py dep with opensearch-py [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/787046 (https://phabricator.wikimedia.org/T301017) [18:09:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:09:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:09:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:42] !log brennen@deploy1002 Synchronized php-1.39.0-wmf.9/skins/WikimediaApiPortal: Backport: [[gerrit:786420|Fix warnings relating to QuickTemplate (T306925)]] (duration: 02m 26s) [18:11:46] (03CR) 10Cwhite: [V: 04-1] "Requires python3-opensearch deb package to be built and uploaded. Request is out for a repo." [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/787046 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [18:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:48] T306925: Deprecations: QuickTemplate::(get/html/text/haveData) with parameters bottomscripts and headelement - https://phabricator.wikimedia.org/T306925 [18:12:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gitlab2002.mgmt.codfw.wmnet with reboot policy FORCED [18:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:20] (03PS1) 10Brennen Bearnes: group1 wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787047 [18:13:22] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787047 (owner: 10Brennen Bearnes) [18:14:03] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet [18:14:06] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787047 (owner: 10Brennen Bearnes) [18:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298558)', diff saved to https://phabricator.wikimedia.org/P26738 and previous config saved to /var/cache/conftool/dbconfig/20220427-181415-ladsgroup.json [18:14:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [18:14:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [18:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:22] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [18:14:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298558)', diff saved to https://phabricator.wikimedia.org/P26739 and previous config saved to /var/cache/conftool/dbconfig/20220427-181423-ladsgroup.json [18:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:41] so far so chill. [18:20:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:20:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:20:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [18:22:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [18:22:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P26740 and previous config saved to /var/cache/conftool/dbconfig/20220427-182222-ladsgroup.json [18:22:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26741 and previous config saved to /var/cache/conftool/dbconfig/20220427-182226-ladsgroup.json [18:22:26] (03PS1) 10Ottomata: Release version 2.1.4-py3.7-4 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/787050 [18:22:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298563)', diff saved to https://phabricator.wikimedia.org/P26742 and previous config saved to /var/cache/conftool/dbconfig/20220427-182230-ladsgroup.json [18:22:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [18:22:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [18:22:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298563)', diff saved to https://phabricator.wikimedia.org/P26743 and previous config saved to /var/cache/conftool/dbconfig/20220427-182238-ladsgroup.json [18:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:59] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:05] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [18:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:35] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.9 refs T305215 [18:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:42] T305215: 1.39.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T305215 [18:25:33] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.9 refs T305215 (duration: 00m 56s) [18:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:07] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1003.eqiad.wmnet [18:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:47] brennen: OK for me to resume container build/deploy testing? [18:27:33] (03PS2) 10Ottomata: Release version 2.1.4-py3.7-4 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/787050 [18:28:33] dancy: looking pretty quiet, i'd say go ahead. [18:32:45] (03PS5) 10Dzahn: add a namespace for new service image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) [18:33:28] (03CR) 10Dzahn: add a namespace for new service image-suggestion (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [18:35:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab2002.mgmt.codfw.wmnet with reboot policy FORCED [18:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298563)', diff saved to https://phabricator.wikimedia.org/P26744 and previous config saved to /var/cache/conftool/dbconfig/20220427-183645-ladsgroup.json [18:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:52] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [18:37:07] brennen: thx. Resuming [18:37:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P26745 and previous config saved to /var/cache/conftool/dbconfig/20220427-183727-ladsgroup.json [18:37:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [18:37:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [18:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:33] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:37:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T306560)', diff saved to https://phabricator.wikimedia.org/P26746 and previous config saved to /var/cache/conftool/dbconfig/20220427-183735-ladsgroup.json [18:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:38:15] !log dancy@deploy1002 Started scap: (no justification provided) [18:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:44] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Dzahn) cc: @Jelto @Papaul has asked which partman recipe to use. SInce these are the first physical servers that might not... [18:40:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:40:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:40:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:31] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:45:32] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Dzahn) The current install_server config is: ` gitlab*) echo partman/flat.cfg virtual.cfg ;; \ ` so we can't ke... [18:45:34] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gitlab2003.mgmt.codfw.wmnet with reboot policy FORCED [18:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:29] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@6684963]: (no justification provided) [18:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:41] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@6684963]: (no justification provided) (duration: 00m 12s) [18:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26748 and previous config saved to /var/cache/conftool/dbconfig/20220427-184646-ladsgroup.json [18:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:50:23] !log dancy@deploy1002 Started scap: testing mediawiki container image build and deploy [18:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:31] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1004.eqiad.wmnet [18:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26749 and previous config saved to /var/cache/conftool/dbconfig/20220427-185150-ladsgroup.json [18:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:52] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:28] ^^ That is me [18:59:45] (03PS1) 10Dzahn: install_server/gitlab: separate partman recipes for physical servers [puppet] - 10https://gerrit.wikimedia.org/r/787051 (https://phabricator.wikimedia.org/T301183) [19:00:26] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:00:29] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:27] thanks dancy, ACK [19:01:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26750 and previous config saved to /var/cache/conftool/dbconfig/20220427-190151-ladsgroup.json [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:16] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1004.eqiad.wmnet [19:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:34] (03PS3) 10Ottomata: Release version 2.1.4-py3.7-4 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/787050 [19:04:55] (03PS4) 10Ottomata: Release version 2.1.4-py3.7-4 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/787050 [19:06:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26751 and previous config saved to /var/cache/conftool/dbconfig/20220427-190655-ladsgroup.json [19:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:27] (03CR) 10Dzahn: add image-suggestion.discovery.wmnet and point to ingress-wikikube (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [19:08:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab2003.mgmt.codfw.wmnet with reboot policy FORCED [19:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:59] (03CR) 10Hashar: [C: 04-1] "Since I prepared the Gerrit 3.4 upgrade a new release has been cut: 3.4.4 which has a security fix. So we should aim at that one instead o" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [19:10:14] (03CR) 10Dzahn: [C: 04-2] "meanwhile https://gerrit.wikimedia.org/r/c/operations/dns/+/786322 happened" [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [19:14:22] (03CR) 10Dzahn: "Well, in my original ticket I asked for a custom puppet fact for this thing. Then John pointed out it already works with this (apparently " [puppet] - 10https://gerrit.wikimedia.org/r/786430 (https://phabricator.wikimedia.org/T306830) (owner: 10Dzahn) [19:14:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298558)', diff saved to https://phabricator.wikimedia.org/P26752 and previous config saved to /var/cache/conftool/dbconfig/20220427-191438-ladsgroup.json [19:14:42] (03PS1) 10Ahmon Dancy: Update path to values file with image names [deployment-charts] - 10https://gerrit.wikimedia.org/r/787054 (https://phabricator.wikimedia.org/T299648) [19:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:45] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [19:16:10] (03CR) 10Dzahn: "did this need/get a service restart?" [puppet] - 10https://gerrit.wikimedia.org/r/786984 (https://phabricator.wikimedia.org/T303970) (owner: 10Hashar) [19:16:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26753 and previous config saved to /var/cache/conftool/dbconfig/20220427-191656-ladsgroup.json [19:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gitlab-runner2002.mgmt.codfw.wmnet with reboot policy FORCED [19:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:38] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:41] (03CR) 10Ahmon Dancy: [C: 03+2] Update path to values file with image names [deployment-charts] - 10https://gerrit.wikimedia.org/r/787054 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [19:21:38] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@6684963]: (no justification provided) [19:21:42] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@6684963]: (no justification provided) (duration: 00m 03s) [19:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298563)', diff saved to https://phabricator.wikimedia.org/P26754 and previous config saved to /var/cache/conftool/dbconfig/20220427-192200-ladsgroup.json [19:22:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [19:22:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1110.eqiad.wmnet with reason: Maintenance [19:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:06] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [19:22:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298563)', diff saved to https://phabricator.wikimedia.org/P26755 and previous config saved to /var/cache/conftool/dbconfig/20220427-192209-ladsgroup.json [19:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:31] (03PS1) 10Hashar: Merge tag 'v3.4.4' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787055 (https://phabricator.wikimedia.org/T292759) [19:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:22:55] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Eevans) >>! In T305568#7881920, @Papaul wrote: > @Eevans I received those nodes today so I will be racking them tomorrow. Here is my racking proposal for tomorrow.... [19:23:43] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@6684963]: (no justification provided) [19:23:47] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@6684963]: (no justification provided) (duration: 00m 03s) [19:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:05] !log otto@deploy1002 Started deploy [airflow-dags/analytics_test@6684963]: (no justification provided) [19:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:16] !log otto@deploy1002 Finished deploy [airflow-dags/analytics_test@6684963]: (no justification provided) (duration: 00m 11s) [19:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:15] (03Merged) 10jenkins-bot: Update path to values file with image names [deployment-charts] - 10https://gerrit.wikimedia.org/r/787054 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [19:27:04] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:37] !log otto@deploy1002 Started deploy [airflow-dags/analytics@6684963]: (no justification provided) [19:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:46] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@6684963]: (no justification provided) (duration: 00m 09s) [19:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:22] (03PS1) 10Ahmon Dancy: Fix use of .Release.Name [deployment-charts] - 10https://gerrit.wikimedia.org/r/787058 (https://phabricator.wikimedia.org/T299648) [19:29:37] (03CR) 10jerkins-bot: [V: 04-1] Fix use of .Release.Name [deployment-charts] - 10https://gerrit.wikimedia.org/r/787058 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [19:29:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26756 and previous config saved to /var/cache/conftool/dbconfig/20220427-192943-ladsgroup.json [19:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:48] (03PS2) 10Hashar: Merge tag 'v3.4.4' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787055 (https://phabricator.wikimedia.org/T292759) [19:30:07] (03CR) 10Ahmon Dancy: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/787058 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [19:32:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26757 and previous config saved to /var/cache/conftool/dbconfig/20220427-193201-ladsgroup.json [19:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:32:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [19:32:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [19:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26758 and previous config saved to /var/cache/conftool/dbconfig/20220427-193233-ladsgroup.json [19:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:57] (03CR) 10Ahmon Dancy: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/787058 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [19:37:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298563)', diff saved to https://phabricator.wikimedia.org/P26759 and previous config saved to /var/cache/conftool/dbconfig/20220427-193719-ladsgroup.json [19:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:28] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [19:37:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T306560)', diff saved to https://phabricator.wikimedia.org/P26760 and previous config saved to /var/cache/conftool/dbconfig/20220427-193749-ladsgroup.json [19:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:00] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:38:25] (03CR) 10Brennen Bearnes: "> This means we can't use the wildcard anymore and need to separate" [puppet] - 10https://gerrit.wikimedia.org/r/787051 (https://phabricator.wikimedia.org/T301183) (owner: 10Dzahn) [19:41:27] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.4.4' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787055 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [19:41:36] (03PS1) 10Ahmon Dancy: Revert "Update path to values file with image names" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786424 [19:44:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "Update path to values file with image names" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786424 (owner: 10Ahmon Dancy) [19:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P26761 and previous config saved to /var/cache/conftool/dbconfig/20220427-194448-ladsgroup.json [19:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:16] (03PS2) 10Hashar: Gerrit v3.4.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) [19:46:05] (03CR) 10jerkins-bot: [V: 04-1] Gerrit v3.4.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [19:48:04] 10ops-eqiad, 10Cassandra, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) [19:48:27] 10ops-eqiad, 10Cassandra, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) p:05Triage→03Lowest [19:48:35] (03Merged) 10jenkins-bot: Merge tag 'v3.4.4' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787055 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [19:49:11] (03PS2) 10DLynch: Halt the DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784582 (https://phabricator.wikimedia.org/T291873) [19:49:42] 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10wiki_willy) a:03Cmjohnson [19:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:52:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26762 and previous config saved to /var/cache/conftool/dbconfig/20220427-195224-ladsgroup.json [19:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:34] 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) [19:52:47] 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) [19:52:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26763 and previous config saved to /var/cache/conftool/dbconfig/20220427-195255-ladsgroup.json [19:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26764 and previous config saved to /var/cache/conftool/dbconfig/20220427-195444-ladsgroup.json [19:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:50] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:59:04] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [19:59:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298558)', diff saved to https://phabricator.wikimedia.org/P26765 and previous config saved to /var/cache/conftool/dbconfig/20220427-195953-ladsgroup.json [19:59:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:59:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:59:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:00:01] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [20:00:04] RoanKattouw, Urbanecm, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220427T2000). [20:00:04] kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298558)', diff saved to https://phabricator.wikimedia.org/P26766 and previous config saved to /var/cache/conftool/dbconfig/20220427-200006-ladsgroup.json [20:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:20] 👋 [20:01:23] Hello [20:01:40] (03CR) 10Hashar: "Updated from 3.4.3 to 3.4.4 :)" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/787000 (https://phabricator.wikimedia.org/T292759) (owner: 10Hashar) [20:02:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298558)', diff saved to https://phabricator.wikimedia.org/P26767 and previous config saved to /var/cache/conftool/dbconfig/20220427-200213-ladsgroup.json [20:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:27] (03CR) 10Catrope: [C: 03+2] Halt the DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784582 (https://phabricator.wikimedia.org/T291873) (owner: 10DLynch) [20:02:28] * urbanecm waves [20:03:31] (03Merged) 10jenkins-bot: Halt the DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784582 (https://phabricator.wikimedia.org/T291873) (owner: 10DLynch) [20:04:32] Kemayo: Your patch is on mwdebug1002, please test [20:04:44] Testing now [20:05:22] RoanKattouw: Looks good [20:06:51] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:784582|Halt the DiscussionTools A/B test (T291873)]] (duration: 00m 51s) [20:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:58] T291873: Deploy config change to "turn off" New Discussion Tool A/B test - https://phabricator.wikimedia.org/T291873 [20:07:01] Kemayo: All done [20:07:09] Thanks! [20:07:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26768 and previous config saved to /var/cache/conftool/dbconfig/20220427-200729-ladsgroup.json [20:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P26769 and previous config saved to /var/cache/conftool/dbconfig/20220427-200800-ladsgroup.json [20:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26770 and previous config saved to /var/cache/conftool/dbconfig/20220427-200949-ladsgroup.json [20:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:00] (03PS1) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [20:17:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P26771 and previous config saved to /var/cache/conftool/dbconfig/20220427-201718-ladsgroup.json [20:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:53] (03PS2) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [20:21:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab-runner2002.mgmt.codfw.wmnet with reboot policy FORCED [20:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298563)', diff saved to https://phabricator.wikimedia.org/P26772 and previous config saved to /var/cache/conftool/dbconfig/20220427-202234-ladsgroup.json [20:22:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:22:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [20:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:40] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [20:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T306560)', diff saved to https://phabricator.wikimedia.org/P26773 and previous config saved to /var/cache/conftool/dbconfig/20220427-202306-ladsgroup.json [20:23:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gitlab-runner2003.mgmt.codfw.wmnet with reboot policy FORCED [20:23:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [20:23:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [20:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:12] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:23:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P26774 and previous config saved to /var/cache/conftool/dbconfig/20220427-202314-ladsgroup.json [20:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26775 and previous config saved to /var/cache/conftool/dbconfig/20220427-202454-ladsgroup.json [20:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P26776 and previous config saved to /var/cache/conftool/dbconfig/20220427-202527-ladsgroup.json [20:25:29] (03PS3) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [20:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:04] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:32:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P26777 and previous config saved to /var/cache/conftool/dbconfig/20220427-203223-ladsgroup.json [20:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:28] (03CR) 10Ottomata: [C: 03+2] Release version 2.1.4-py3.7-4 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/787050 (owner: 10Ottomata) [20:33:30] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Release version 2.1.4-py3.7-4 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/787050 (owner: 10Ottomata) [20:35:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [20:35:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [20:35:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [20:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [20:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:59] !log ms-fe1009 - systemctl restart cron [20:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:40] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:22] ^ this is going to be some temp issue pulling data from comodo..expect self healing [20:39:26] happened before [20:39:42] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:39:52] !log alert1001 - systemctl start certspotter [20:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26778 and previous config saved to /var/cache/conftool/dbconfig/20220427-203959-ladsgroup.json [20:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:40:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [20:40:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [20:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26779 and previous config saved to /var/cache/conftool/dbconfig/20220427-204031-ladsgroup.json [20:40:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26780 and previous config saved to /var/cache/conftool/dbconfig/20220427-204032-ladsgroup.json [20:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:10] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:13] (03CR) 10Dzahn: [C: 03+2] "another pair of eyes saying "looks reasoable" is good enough for me here, that way Papaul can start installing later today. and yea, it's " [puppet] - 10https://gerrit.wikimedia.org/r/787051 (https://phabricator.wikimedia.org/T301183) (owner: 10Dzahn) [20:44:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, and 2 others: Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) You should be unblocked to install OS. partman recipe set to raid1-dev. [20:45:13] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Dzahn) @Papaul You should be unblocked to install OS. partman recipe set to raid1-dev. [20:47:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298558)', diff saved to https://phabricator.wikimedia.org/P26781 and previous config saved to /var/cache/conftool/dbconfig/20220427-204728-ladsgroup.json [20:47:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [20:47:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [20:47:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T298558)', diff saved to https://phabricator.wikimedia.org/P26782 and previous config saved to /var/cache/conftool/dbconfig/20220427-204736-ladsgroup.json [20:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:39] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [20:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:48:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:46] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) @Dzahn thanks [20:48:46] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298558)', diff saved to https://phabricator.wikimedia.org/P26783 and previous config saved to /var/cache/conftool/dbconfig/20220427-204943-ladsgroup.json [20:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:44] (03PS1) 10Ebernhardson: cirrus: Enable DeprecationLoggedHttps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787068 (https://phabricator.wikimedia.org/T218994) [20:52:47] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [20:54:02] (03PS1) 10Ebernhardson: cirrus: Turn on AB test of wbsearchentities profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787069 (https://phabricator.wikimedia.org/T306644) [20:54:27] sneaking a few patches into the end of backport window [20:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P26784 and previous config saved to /var/cache/conftool/dbconfig/20220427-205537-ladsgroup.json [20:55:39] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Turn on AB test of wbsearchentities profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787069 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [20:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:16] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=ms-fe1009.eqiad.wmnet [20:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:25] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 115 probes of 669 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:56:35] (03Merged) 10jenkins-bot: cirrus: Turn on AB test of wbsearchentities profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787069 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [20:58:49] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:59:06] (03PS1) 10Ladsgroup: SpecialExport: Add page table once [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/787089 (https://phabricator.wikimedia.org/T307037) [20:59:36] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=ms-fe1009.eqiad.wmnet [20:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [21:00:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [21:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298563)', diff saved to https://phabricator.wikimedia.org/P26785 and previous config saved to /var/cache/conftool/dbconfig/20220427-210041-ladsgroup.json [21:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:50] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [21:01:01] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:787069|cirrus: Turn on AB test of wbsearchentities profiles (T306644)]] (duration: 00m 53s) [21:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:07] T306644: re-run wbsearchentities optimization process - https://phabricator.wikimedia.org/T306644 [21:01:19] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Enable DeprecationLoggedHttps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787068 (https://phabricator.wikimedia.org/T218994) (owner: 10Ebernhardson) [21:01:59] (03Merged) 10jenkins-bot: cirrus: Enable DeprecationLoggedHttps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787068 (https://phabricator.wikimedia.org/T218994) (owner: 10Ebernhardson) [21:02:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26786 and previous config saved to /var/cache/conftool/dbconfig/20220427-210237-ladsgroup.json [21:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:02:59] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 65 probes of 669 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:04:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P26787 and previous config saved to /var/cache/conftool/dbconfig/20220427-210448-ladsgroup.json [21:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:25] (03PS1) 10Ebernhardson: Forward CirrusSearchDeprecation logs to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787074 (https://phabricator.wikimedia.org/T218994) [21:06:45] (03CR) 10Ebernhardson: [C: 03+2] Forward CirrusSearchDeprecation logs to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787074 (https://phabricator.wikimedia.org/T218994) (owner: 10Ebernhardson) [21:07:40] (03Merged) 10jenkins-bot: Forward CirrusSearchDeprecation logs to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787074 (https://phabricator.wikimedia.org/T218994) (owner: 10Ebernhardson) [21:07:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab-runner2003.mgmt.codfw.wmnet with reboot policy FORCED [21:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:09] (03CR) 10Brennen Bearnes: install_server/gitlab: separate partman recipes for physical servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787051 (https://phabricator.wikimedia.org/T301183) (owner: 10Dzahn) [21:09:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gitlab-runner2004.mgmt.codfw.wmnet with reboot policy FORCED [21:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P26788 and previous config saved to /var/cache/conftool/dbconfig/20220427-211042-ladsgroup.json [21:10:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [21:10:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [21:10:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [21:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:48] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:10:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [21:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T306560)', diff saved to https://phabricator.wikimedia.org/P26789 and previous config saved to /var/cache/conftool/dbconfig/20220427-211055-ladsgroup.json [21:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:13] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:787074|Forward CirrusSearchDeprecation logs to logstash (T218994)]] (duration: 00m 56s) [21:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:18] T218994: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 [21:13:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T306560)', diff saved to https://phabricator.wikimedia.org/P26790 and previous config saved to /var/cache/conftool/dbconfig/20220427-211305-ladsgroup.json [21:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:33] !log ebernhardson@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:787068|cirrus: Enable DeprecationLoggedHttps (T218994)]] (duration: 00m 54s) [21:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298563)', diff saved to https://phabricator.wikimedia.org/P26791 and previous config saved to /var/cache/conftool/dbconfig/20220427-211352-ladsgroup.json [21:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:58] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [21:16:17] !log ebernhardson@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: Revert: [[gerrit:787068|cirrus: Enable DeprecationLoggedHttps (T218994)]] (duration: 00m 58s) [21:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:25] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 365 probes of 669 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:17:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26792 and previous config saved to /var/cache/conftool/dbconfig/20220427-211742-ladsgroup.json [21:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:58] (03PS1) 10Ebernhardson: Revert "cirrus: Enable DeprecationLoggedHttps" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787076 [21:19:34] (03CR) 10Ebernhardson: [C: 03+2] Revert "cirrus: Enable DeprecationLoggedHttps" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787076 (owner: 10Ebernhardson) [21:19:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P26793 and previous config saved to /var/cache/conftool/dbconfig/20220427-211953-ladsgroup.json [21:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:15] (03Merged) 10jenkins-bot: Revert "cirrus: Enable DeprecationLoggedHttps" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787076 (owner: 10Ebernhardson) [21:21:29] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:23:12] (03CR) 10Ahmon Dancy: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/786424 (owner: 10Ahmon Dancy) [21:23:41] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 70 probes of 669 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:26:50] (03CR) 10Dzahn: "data did not get synced to https://config-master.wikimedia.org/pybal/codfw/ ??" [puppet] - 10https://gerrit.wikimedia.org/r/785918 (https://phabricator.wikimedia.org/T290192) (owner: 10Dzahn) [21:27:43] 10SRE, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) [21:27:51] 10SRE, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) 05Resolved→03Open [21:28:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26794 and previous config saved to /var/cache/conftool/dbconfig/20220427-212810-ladsgroup.json [21:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:17] 10SRE, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) a:05Papaul→03Dzahn [21:28:52] 10SRE, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) after https://gerrit.wikimedia.org/r/c/operations/puppet/+/785918 the conftool-data change does not appear on https://config-master.wikimedia.org/pybal/codfw/ ? [21:31:01] 10SRE, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) @Dzahn i think it is best to create another task for this issue and not reopen the rack/setup task. Thanks [21:32:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26795 and previous config saved to /var/cache/conftool/dbconfig/20220427-213247-ladsgroup.json [21:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T298558)', diff saved to https://phabricator.wikimedia.org/P26796 and previous config saved to /var/cache/conftool/dbconfig/20220427-213458-ladsgroup.json [21:35:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [21:35:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [21:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:05] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [21:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T298558)', diff saved to https://phabricator.wikimedia.org/P26797 and previous config saved to /var/cache/conftool/dbconfig/20220427-213507-ladsgroup.json [21:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298558)', diff saved to https://phabricator.wikimedia.org/P26798 and previous config saved to /var/cache/conftool/dbconfig/20220427-213715-ladsgroup.json [21:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:30] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Dzahn) confirming that the "gitlab" hosts should use a public IP and the "gitlab-runner" hosts should use a private IP. [21:42:16] (03PS1) 10Cwhite: beta-logs: temporarily undefine cluster jobs_host [puppet] - 10https://gerrit.wikimedia.org/r/787084 [21:42:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) confirming that the "gitlab" hosts should use a public IP and the "gitlab-runner" hosts should use a... [21:43:10] (03CR) 10Cwhite: [C: 03+2] beta-logs: temporarily undefine cluster jobs_host [puppet] - 10https://gerrit.wikimedia.org/r/787084 (owner: 10Cwhite) [21:43:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P26799 and previous config saved to /var/cache/conftool/dbconfig/20220427-214315-ladsgroup.json [21:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:55] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2412.codfw.wmnet [21:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26800 and previous config saved to /var/cache/conftool/dbconfig/20220427-214752-ladsgroup.json [21:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:58] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:48:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [21:48:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [21:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26801 and previous config saved to /var/cache/conftool/dbconfig/20220427-214825-ladsgroup.json [21:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:16] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Dzahn) 05Open→03In progress [21:50:11] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:59] (03CR) 10Dzahn: [C: 03+2] add a namespace for new service image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [21:52:01] (03PS4) 10Razzi: dbproxy: add clouddb sections to conftool [puppet] - 10https://gerrit.wikimedia.org/r/779926 (https://phabricator.wikimedia.org/T304478) [21:52:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P26802 and previous config saved to /var/cache/conftool/dbconfig/20220427-215220-ladsgroup.json [21:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:27] (03CR) 10Dzahn: "attention-set reversal;)" [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn) [21:54:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:17] (03Merged) 10jenkins-bot: add a namespace for new service image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [21:56:28] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T306560)', diff saved to https://phabricator.wikimedia.org/P26803 and previous config saved to /var/cache/conftool/dbconfig/20220427-215820-ladsgroup.json [21:58:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [21:58:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [21:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:27] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:58:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T306560)', diff saved to https://phabricator.wikimedia.org/P26804 and previous config saved to /var/cache/conftool/dbconfig/20220427-215828-ladsgroup.json [21:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [21:59:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [21:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298563)', diff saved to https://phabricator.wikimedia.org/P26805 and previous config saved to /var/cache/conftool/dbconfig/20220427-215914-ladsgroup.json [21:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:22] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [22:00:03] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:01:35] (03CR) 10Ahmon Dancy: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/787058 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [22:01:37] (03PS6) 10Dzahn: docker: ensure apparmor package is installed if on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/785226 [22:01:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T306560)', diff saved to https://phabricator.wikimedia.org/P26806 and previous config saved to /var/cache/conftool/dbconfig/20220427-220139-ladsgroup.json [22:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:03] RECOVERY - mediawiki-installation DSH group on mw2412 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:02:47] (03CR) 10Dzahn: "rebased on https://gerrit.wikimedia.org/r/c/operations/puppet/+/785910 , added single quotes, adjusted comment text" [puppet] - 10https://gerrit.wikimedia.org/r/785226 (owner: 10Dzahn) [22:03:43] (03CR) 10Dzahn: "Hello Andrew, I noticed this because I had to rebase on top of it. Made me think you might also be interested in https://gerrit.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/785910 (https://phabricator.wikimedia.org/T304433) (owner: 10Ottomata) [22:04:06] (03PS1) 10Papaul: Add new gitlab nodes in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787085 (https://phabricator.wikimedia.org/T301183) [22:04:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:48] (03CR) 10jerkins-bot: [V: 04-1] Add new gitlab nodes in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787085 (https://phabricator.wikimedia.org/T301183) (owner: 10Papaul) [22:05:44] (03CR) 10Dzahn: [C: 04-1] Add new gitlab nodes in site.pp (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787085 (https://phabricator.wikimedia.org/T301183) (owner: 10Papaul) [22:06:10] (03PS1) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787106 (https://phabricator.wikimedia.org/T299797) [22:07:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/787106 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [22:07:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P26807 and previous config saved to /var/cache/conftool/dbconfig/20220427-220725-ladsgroup.json [22:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26808 and previous config saved to /var/cache/conftool/dbconfig/20220427-220930-ladsgroup.json [22:09:34] !log running puppet on kubemasters - adding namespace to kubernetes for new service image-suggestion (T304891, T305155) [22:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:43] T304891: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 [22:09:43] T305155: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 [22:12:19] (03PS1) 10Dzahn: admin/dzahn: use the run-puppet-agent wrapper in my personal aliases [puppet] - 10https://gerrit.wikimedia.org/r/787107 [22:12:33] !log dzahn@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [22:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:32] (03PS2) 10Papaul: Add new gitlab nodes in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787085 (https://phabricator.wikimedia.org/T301183) [22:16:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298563)', diff saved to https://phabricator.wikimedia.org/P26809 and previous config saved to /var/cache/conftool/dbconfig/20220427-221600-ladsgroup.json [22:16:06] (03CR) 10jerkins-bot: [V: 04-1] Add new gitlab nodes in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787085 (https://phabricator.wikimedia.org/T301183) (owner: 10Papaul) [22:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:07] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [22:16:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26810 and previous config saved to /var/cache/conftool/dbconfig/20220427-221644-ladsgroup.json [22:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gitlab-runner2004.mgmt.codfw.wmnet with reboot policy FORCED [22:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:01] !log dzahn@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [22:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:32] (03PS3) 10Papaul: Add new gitlab nodes in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787085 (https://phabricator.wikimedia.org/T301183) [22:20:27] (03CR) 10Papaul: [C: 03+2] Add new gitlab nodes in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/787085 (https://phabricator.wikimedia.org/T301183) (owner: 10Papaul) [22:22:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T298558)', diff saved to https://phabricator.wikimedia.org/P26811 and previous config saved to /var/cache/conftool/dbconfig/20220427-222230-ladsgroup.json [22:22:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [22:22:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [22:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [22:22:40] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [22:22:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [22:22:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [22:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [22:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:22:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [22:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [22:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [22:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T298558)', diff saved to https://phabricator.wikimedia.org/P26812 and previous config saved to /var/cache/conftool/dbconfig/20220427-222306-ladsgroup.json [22:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bullseye [22:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26813 and previous config saved to /var/cache/conftool/dbconfig/20220427-222435-ladsgroup.json [22:24:38] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host git... [22:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298558)', diff saved to https://phabricator.wikimedia.org/P26814 and previous config saved to /var/cache/conftool/dbconfig/20220427-222514-ladsgroup.json [22:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:50] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) [22:26:23] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure), 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) [22:31:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26815 and previous config saved to /var/cache/conftool/dbconfig/20220427-223105-ladsgroup.json [22:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P26816 and previous config saved to /var/cache/conftool/dbconfig/20220427-223149-ladsgroup.json [22:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:38] (03PS1) 10Ahmon Dancy: Allow deploy-mwdebug.py to be paused externally [puppet] - 10https://gerrit.wikimedia.org/r/787108 (https://phabricator.wikimedia.org/T299648) [22:34:36] !log kubernetes - Uprading release=namespaces/namspace-certificates which added developer-portal and image-suggestion namespaces - but only on staging-codfw - (T304891, T305155, T297140) [22:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:44] T297140: New Service Request: developer-portal - https://phabricator.wikimedia.org/T297140 [22:34:44] T304891: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 [22:34:44] T305155: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 [22:35:29] (03CR) 10jerkins-bot: [V: 04-1] Allow deploy-mwdebug.py to be paused externally [puppet] - 10https://gerrit.wikimedia.org/r/787108 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [22:37:09] (03PS2) 10Bking: Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787106 (https://phabricator.wikimedia.org/T299797) [22:37:25] (03PS2) 10Ahmon Dancy: Allow deploy-mwdebug.py to be paused externally [puppet] - 10https://gerrit.wikimedia.org/r/787108 (https://phabricator.wikimedia.org/T299648) [22:37:43] (03CR) 10jerkins-bot: [V: 04-1] Elastic: test puppet logic [puppet] - 10https://gerrit.wikimedia.org/r/787106 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [22:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:39:03] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gitlab2002.wikimedia.org with OS bullseye [22:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:10] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gitlab2002.wikimedia.org with... [22:39:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26817 and previous config saved to /var/cache/conftool/dbconfig/20220427-223940-ladsgroup.json [22:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26818 and previous config saved to /var/cache/conftool/dbconfig/20220427-224019-ladsgroup.json [22:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26819 and previous config saved to /var/cache/conftool/dbconfig/20220427-224610-ladsgroup.json [22:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T306560)', diff saved to https://phabricator.wikimedia.org/P26820 and previous config saved to /var/cache/conftool/dbconfig/20220427-224654-ladsgroup.json [22:46:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:46:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:00] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [22:47:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [22:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P26821 and previous config saved to /var/cache/conftool/dbconfig/20220427-224711-ladsgroup.json [22:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:29] (03PS1) 10Cwhite: opensearch: add configurable curator version parameter [puppet] - 10https://gerrit.wikimedia.org/r/787109 (https://phabricator.wikimedia.org/T301017) [22:50:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P26822 and previous config saved to /var/cache/conftool/dbconfig/20220427-225026-ladsgroup.json [22:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2002.wikimedia.org with OS bullseye [22:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:53] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gitlab2002.wikimedia.org... [22:54:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26823 and previous config saved to /var/cache/conftool/dbconfig/20220427-225445-ladsgroup.json [22:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:55:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [22:55:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [22:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26824 and previous config saved to /var/cache/conftool/dbconfig/20220427-225517-ladsgroup.json [22:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P26825 and previous config saved to /var/cache/conftool/dbconfig/20220427-225524-ladsgroup.json [22:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:59] PROBLEM - DNS on logstash2028.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.193.1.93 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:58:36] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gitlab2002.wikimedia.org with OS bullseye [22:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:43] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gitlab2002.wikimedia.org with... [22:59:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab-runner2002.codfw.wmnet with OS bullseye [22:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:54] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host gitlab-runner2002.codfw.w... [23:01:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298563)', diff saved to https://phabricator.wikimedia.org/P26826 and previous config saved to /var/cache/conftool/dbconfig/20220427-230116-ladsgroup.json [23:01:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:01:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:01:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:24] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [23:01:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298563)', diff saved to https://phabricator.wikimedia.org/P26827 and previous config saved to /var/cache/conftool/dbconfig/20220427-230130-ladsgroup.json [23:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:04:39] (03CR) 10Razzi: "This patch adds the skeleton of etcd services, so that in another patch I can use the config. Pretty much what you suggested on Phabricato" [puppet] - 10https://gerrit.wikimedia.org/r/779926 (https://phabricator.wikimedia.org/T304478) (owner: 10Razzi) [23:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26828 and previous config saved to /var/cache/conftool/dbconfig/20220427-230531-ladsgroup.json [23:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T298558)', diff saved to https://phabricator.wikimedia.org/P26829 and previous config saved to /var/cache/conftool/dbconfig/20220427-231029-ladsgroup.json [23:10:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [23:10:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [23:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:36] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [23:10:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298558)', diff saved to https://phabricator.wikimedia.org/P26830 and previous config saved to /var/cache/conftool/dbconfig/20220427-231037-ladsgroup.json [23:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298558)', diff saved to https://phabricator.wikimedia.org/P26831 and previous config saved to /var/cache/conftool/dbconfig/20220427-231145-ladsgroup.json [23:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298563)', diff saved to https://phabricator.wikimedia.org/P26832 and previous config saved to /var/cache/conftool/dbconfig/20220427-231437-ladsgroup.json [23:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:44] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [23:17:19] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:17:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T298565)', diff saved to https://phabricator.wikimedia.org/P26833 and previous config saved to /var/cache/conftool/dbconfig/20220427-231729-ladsgroup.json [23:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:17:43] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) @Dzahn OS install failed on gitlab-runner2002 because of partitioning. I think is becasue you have: partman/raid1-2... [23:18:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:18:23] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:20:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P26834 and previous config saved to /var/cache/conftool/dbconfig/20220427-232036-ladsgroup.json [23:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T307048 (10RobH) [23:22:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T307048 (10RobH) [23:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:26:22] (03PS3) 10Razzi: Use both dbproxy101[89] servers for both wikireplica services [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [23:26:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P26835 and previous config saved to /var/cache/conftool/dbconfig/20220427-232650-ladsgroup.json [23:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup2009 - https://phabricator.wikimedia.org/T307049 (10RobH) [23:27:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q4:(Need By: TBD) rack/setup/install backup2009 - https://phabricator.wikimedia.org/T307049 (10RobH) [23:29:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26836 and previous config saved to /var/cache/conftool/dbconfig/20220427-232942-ladsgroup.json [23:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:15] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab-runner2002.codfw.wmnet with OS bullseye [23:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:22] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host gitlab-runner2002.codfw.wmnet... [23:32:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26837 and previous config saved to /var/cache/conftool/dbconfig/20220427-233234-ladsgroup.json [23:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:39] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:33] (03CR) 10Razzi: [C: 03+1] "This LGTM; even if it blurs the analytics/web distinction it's worth having high availability for dbproxy hosts." [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [23:35:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T306560)', diff saved to https://phabricator.wikimedia.org/P26838 and previous config saved to /var/cache/conftool/dbconfig/20220427-233541-ladsgroup.json [23:35:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [23:35:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [23:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:48] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [23:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [23:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [23:35:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [23:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [23:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [23:36:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [23:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T306560)', diff saved to https://phabricator.wikimedia.org/P26839 and previous config saved to /var/cache/conftool/dbconfig/20220427-233628-ladsgroup.json [23:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:43] (03PS1) 10Stang: itwiki: assign 'setmentor' to 'bot' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787111 (https://phabricator.wikimedia.org/T307005) [23:38:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T306560)', diff saved to https://phabricator.wikimedia.org/P26840 and previous config saved to /var/cache/conftool/dbconfig/20220427-233839-ladsgroup.json [23:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P26841 and previous config saved to /var/cache/conftool/dbconfig/20220427-234155-ladsgroup.json [23:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26842 and previous config saved to /var/cache/conftool/dbconfig/20220427-234448-ladsgroup.json [23:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:23] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:47:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P26843 and previous config saved to /var/cache/conftool/dbconfig/20220427-234739-ladsgroup.json [23:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:04] (03PS3) 10Stang: Add kanoner.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786407 (https://phabricator.wikimedia.org/T306795) [23:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:53:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P26844 and previous config saved to /var/cache/conftool/dbconfig/20220427-235344-ladsgroup.json [23:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298558)', diff saved to https://phabricator.wikimedia.org/P26845 and previous config saved to /var/cache/conftool/dbconfig/20220427-235700-ladsgroup.json [23:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:07] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [23:58:47] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:59:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298563)', diff saved to https://phabricator.wikimedia.org/P26846 and previous config saved to /var/cache/conftool/dbconfig/20220427-235953-ladsgroup.json [23:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:59] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563