[00:00:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [00:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2005-dev.wikimedia.org with reason: host reimage [00:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:01] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) second interface added for clounet2006 ` [edit interfaces] + ge-1/0/26 { + description cloudnet2006-dev; + unit 0 { +... [00:46:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudweb2002-dev.wikimedia.org with OS buster [00:49:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P25492 and previous config saved to /var/cache/conftool/dbconfig/20220420-004907-ladsgroup.json [00:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb2002-dev.wikimedia.org with reason: host reimage [01:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P25495 and previous config saved to /var/cache/conftool/dbconfig/20220420-010412-ladsgroup.json [01:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb2002-dev.wikimedia.org with reason: host reimage [01:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:39] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 30.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:16:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb2002-dev.wikimedia.org with OS buster [01:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:18] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudweb2002-dev.wikimedia.org with OS buster comple... [01:16:51] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [01:17:03] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 51.9 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:18:29] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) 05Open→03Resolved @Andrew and Cloud team this is ready for service . Thanks [01:18:37] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:18:45] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:18:49] 10SRE, 10ops-codfw: mc2031.mgmt looks down from icinga's perspective - https://phabricator.wikimedia.org/T306438 (10Papaul) p:05Triage→03Medium [01:19:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P25496 and previous config saved to /var/cache/conftool/dbconfig/20220420-011917-ladsgroup.json [01:19:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [01:19:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [01:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25497 and previous config saved to /var/cache/conftool/dbconfig/20220420-011925-ladsgroup.json [01:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:57] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:48:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25498 and previous config saved to /var/cache/conftool/dbconfig/20220420-015341-ladsgroup.json [01:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:54:57] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:55:11] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 56, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:55:29] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:56:07] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:57:29] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:08:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25499 and previous config saved to /var/cache/conftool/dbconfig/20220420-020846-ladsgroup.json [02:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:15] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 60, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:12:25] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:12:33] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:13:11] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:14:05] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:19:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25500 and previous config saved to /var/cache/conftool/dbconfig/20220420-021939-ladsgroup.json [02:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:23:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25501 and previous config saved to /var/cache/conftool/dbconfig/20220420-022352-ladsgroup.json [02:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:34:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25502 and previous config saved to /var/cache/conftool/dbconfig/20220420-023444-ladsgroup.json [02:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25503 and previous config saved to /var/cache/conftool/dbconfig/20220420-023857-ladsgroup.json [02:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:39:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [02:39:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [02:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25504 and previous config saved to /var/cache/conftool/dbconfig/20220420-023951-ladsgroup.json [02:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25505 and previous config saved to /var/cache/conftool/dbconfig/20220420-024611-ladsgroup.json [02:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:49:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25506 and previous config saved to /var/cache/conftool/dbconfig/20220420-024949-ladsgroup.json [02:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25507 and previous config saved to /var/cache/conftool/dbconfig/20220420-030116-ladsgroup.json [03:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:04:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25508 and previous config saved to /var/cache/conftool/dbconfig/20220420-030454-ladsgroup.json [03:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:59] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:16:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25509 and previous config saved to /var/cache/conftool/dbconfig/20220420-031621-ladsgroup.json [03:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:27] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:21:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [03:21:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [03:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [03:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [03:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P25510 and previous config saved to /var/cache/conftool/dbconfig/20220420-032157-ladsgroup.json [03:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:22:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:31:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25511 and previous config saved to /var/cache/conftool/dbconfig/20220420-033126-ladsgroup.json [03:31:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [03:31:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [03:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:32] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [03:35:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [03:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [03:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [03:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [03:37:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [03:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [03:42:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [03:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25512 and previous config saved to /var/cache/conftool/dbconfig/20220420-034211-ladsgroup.json [03:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:15] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:44:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25513 and previous config saved to /var/cache/conftool/dbconfig/20220420-034443-ladsgroup.json [03:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:39] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P25514 and previous config saved to /var/cache/conftool/dbconfig/20220420-035142-ladsgroup.json [03:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:57:27] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:58:21] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:58:41] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:06:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P25515 and previous config saved to /var/cache/conftool/dbconfig/20220420-040647-ladsgroup.json [04:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:37] RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:09:25] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P25516 and previous config saved to /var/cache/conftool/dbconfig/20220420-042152-ladsgroup.json [04:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:35] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (cloudcontrol2003-dev), No backups: 2 (ldap-corp1001, ...), Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:29:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [04:30:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [04:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25517 and previous config saved to /var/cache/conftool/dbconfig/20220420-043005-ladsgroup.json [04:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:34:56] (03PS1) 10Marostegui: Revert "db1136: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/783919 [04:36:15] (03CR) 10Marostegui: [C: 03+2] Revert "db1136: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/783919 (owner: 10Marostegui) [04:37:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25518 and previous config saved to /var/cache/conftool/dbconfig/20220420-043700-ladsgroup.json [04:37:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P25519 and previous config saved to /var/cache/conftool/dbconfig/20220420-043702-root.json [04:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [04:37:05] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:37:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [04:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P25520 and previous config saved to /var/cache/conftool/dbconfig/20220420-043711-ladsgroup.json [04:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [04:40:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [04:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [04:41:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [04:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T306269)', diff saved to https://phabricator.wikimedia.org/P25521 and previous config saved to /var/cache/conftool/dbconfig/20220420-044132-marostegui.json [04:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:37] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [04:44:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T306269)', diff saved to https://phabricator.wikimedia.org/P25522 and previous config saved to /var/cache/conftool/dbconfig/20220420-044443-marostegui.json [04:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:45] (03PS1) 10Marostegui: db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/784355 (https://phabricator.wikimedia.org/T301879) [04:51:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1132 into s1 T301879', diff saved to https://phabricator.wikimedia.org/P25523 and previous config saved to /var/cache/conftool/dbconfig/20220420-045108-marostegui.json [04:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:14] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [04:51:24] (03CR) 10Marostegui: [C: 03+2] db1132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/784355 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [04:52:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25524 and previous config saved to /var/cache/conftool/dbconfig/20220420-045205-ladsgroup.json [04:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P25525 and previous config saved to /var/cache/conftool/dbconfig/20220420-045212-root.json [04:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P25526 and previous config saved to /var/cache/conftool/dbconfig/20220420-045416-ladsgroup.json [04:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:59:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P25527 and previous config saved to /var/cache/conftool/dbconfig/20220420-045948-marostegui.json [04:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:07:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25528 and previous config saved to /var/cache/conftool/dbconfig/20220420-050710-ladsgroup.json [05:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P25529 and previous config saved to /var/cache/conftool/dbconfig/20220420-050716-root.json [05:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P25530 and previous config saved to /var/cache/conftool/dbconfig/20220420-050921-ladsgroup.json [05:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:59] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:12:47] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:13:35] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P25531 and previous config saved to /var/cache/conftool/dbconfig/20220420-051453-marostegui.json [05:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:27] (03PS1) 10Majavah: P:openldap_corp: fix backups [puppet] - 10https://gerrit.wikimedia.org/r/784581 [05:19:48] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34907/console" [puppet] - 10https://gerrit.wikimedia.org/r/784581 (owner: 10Majavah) [05:21:27] (03PS1) 10DLynch: Halt the DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784582 (https://phabricator.wikimedia.org/T291873) [05:22:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25532 and previous config saved to /var/cache/conftool/dbconfig/20220420-052215-ladsgroup.json [05:22:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [05:22:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [05:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P25533 and previous config saved to /var/cache/conftool/dbconfig/20220420-052220-root.json [05:22:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:22:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25534 and previous config saved to /var/cache/conftool/dbconfig/20220420-052223-ladsgroup.json [05:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:31] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:05] (03PS1) 10Marostegui: monitor_eventscheduler.pp: Add new monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) [05:24:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P25535 and previous config saved to /var/cache/conftool/dbconfig/20220420-052427-ladsgroup.json [05:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25536 and previous config saved to /var/cache/conftool/dbconfig/20220420-052910-ladsgroup.json [05:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:15] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:29:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T306269)', diff saved to https://phabricator.wikimedia.org/P25537 and previous config saved to /var/cache/conftool/dbconfig/20220420-052958-marostegui.json [05:30:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [05:30:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [05:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:03] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [05:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T306269)', diff saved to https://phabricator.wikimedia.org/P25538 and previous config saved to /var/cache/conftool/dbconfig/20220420-053006-marostegui.json [05:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:31] (03PS2) 10Marostegui: monitor_eventscheduler.pp: Add new monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) [05:33:04] (03CR) 10jerkins-bot: [V: 04-1] monitor_eventscheduler.pp: Add new monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [05:33:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T306269)', diff saved to https://phabricator.wikimedia.org/P25539 and previous config saved to /var/cache/conftool/dbconfig/20220420-053319-marostegui.json [05:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:19] (03PS3) 10Marostegui: monitor_eventscheduler.pp: Add new monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) [05:37:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P25540 and previous config saved to /var/cache/conftool/dbconfig/20220420-053724-root.json [05:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:59] !log start CF in monitoring mode for drmrs [05:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:02] !log ayounsi@cumin2002 START - Cookbook sre.network.cf [05:39:03] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [05:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P25541 and previous config saved to /var/cache/conftool/dbconfig/20220420-053932-ladsgroup.json [05:39:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:39:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:29] RECOVERY - MariaDB Replica Lag: s8 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:44:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25542 and previous config saved to /var/cache/conftool/dbconfig/20220420-054415-ladsgroup.json [05:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:29] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 117 probes of 668 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P25543 and previous config saved to /var/cache/conftool/dbconfig/20220420-054824-marostegui.json [05:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:57] (03PS4) 10Marostegui: monitor_eventscheduler.pp: Add new monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) [05:49:30] (03CR) 10jerkins-bot: [V: 04-1] monitor_eventscheduler.pp: Add new monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [05:50:31] (03PS5) 10Marostegui: monitor_eventscheduler.pp: Monitor event_scheduler on tests hosts [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) [05:52:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P25544 and previous config saved to /var/cache/conftool/dbconfig/20220420-055228-root.json [05:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:13] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) Looks like we forgot about that during Jin' visits. What's the next step? [05:59:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25545 and previous config saved to /var/cache/conftool/dbconfig/20220420-055920-ladsgroup.json [05:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P25546 and previous config saved to /var/cache/conftool/dbconfig/20220420-060329-marostegui.json [06:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 312 probes of 668 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:06:28] (03CR) 10Jcrespo: [C: 03+2] P:openldap_corp: fix backups [puppet] - 10https://gerrit.wikimedia.org/r/784581 (owner: 10Majavah) [06:07:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1136 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P25547 and previous config saved to /var/cache/conftool/dbconfig/20220420-060732-root.json [06:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:09] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 86 probes of 668 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:13:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10ayounsi) We synced up on IRC. The SCS ports was not configured, imho that's something DCops should do. Once done, looks like the device is stuck in a... [06:14:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25548 and previous config saved to /var/cache/conftool/dbconfig/20220420-061425-ladsgroup.json [06:14:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:14:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [06:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25549 and previous config saved to /var/cache/conftool/dbconfig/20220420-061433-ladsgroup.json [06:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T306269)', diff saved to https://phabricator.wikimedia.org/P25550 and previous config saved to /var/cache/conftool/dbconfig/20220420-061834-marostegui.json [06:18:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:18:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [06:18:39] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [06:18:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [06:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:18:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T306269)', diff saved to https://phabricator.wikimedia.org/P25551 and previous config saved to /var/cache/conftool/dbconfig/20220420-061848-marostegui.json [06:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:15] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10ayounsi) @RobH as ripe atlas can take time to provision, it would be nice to not wait too long. Similarly if the warranty support expires in 1 or 2 months. [06:21:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25552 and previous config saved to /var/cache/conftool/dbconfig/20220420-062133-ladsgroup.json [06:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:38] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:22:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T306269)', diff saved to https://phabricator.wikimedia.org/P25553 and previous config saved to /var/cache/conftool/dbconfig/20220420-062206-marostegui.json [06:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:24:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [06:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [06:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:26:03] (03PS1) 10Urbanecm: plwiki: Fix cascading protection configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784619 (https://phabricator.wikimedia.org/T306300) [06:31:34] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10jcrespo) backups are complaining of lack of recent backups of cloudcontrol2003-dev, as it is down. I will ignore those for a while- we must reenable monitoring once maintenance completes. [06:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:33:07] (03PS1) 10Kevin Bazira: ml-services: add svwiki, tawiki & translatewiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/784620 (https://phabricator.wikimedia.org/T301415) [06:33:27] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/pcc-worker1002/34912/" [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [06:33:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:36:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P25554 and previous config saved to /var/cache/conftool/dbconfig/20220420-063638-ladsgroup.json [06:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P25555 and previous config saved to /var/cache/conftool/dbconfig/20220420-063711-marostegui.json [06:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:01] (03PS1) 10Jcrespo: backup: Ignore cloudcontrol2003-dev backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/784621 (https://phabricator.wikimedia.org/T305469) [06:40:23] (03PS2) 10Jcrespo: backup: Ignore cloudcontrol2003-dev backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/784621 (https://phabricator.wikimedia.org/T305469) [06:42:21] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:42:22] (03CR) 10Jcrespo: [C: 03+2] backup: Ignore cloudcontrol2003-dev backup monitoring [puppet] - 10https://gerrit.wikimedia.org/r/784621 (https://phabricator.wikimedia.org/T305469) (owner: 10Jcrespo) [06:46:33] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:50:28] (03PS1) 10Jcrespo: Revert "backup: Ignore cloudcontrol2003-dev backup monitoring" [puppet] - 10https://gerrit.wikimedia.org/r/783922 [06:51:08] (03CR) 10Jcrespo: [C: 04-2] "-2 as we are waiting for maintenance to complete to reenable backup monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/783922 (owner: 10Jcrespo) [06:51:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P25556 and previous config saved to /var/cache/conftool/dbconfig/20220420-065143-ladsgroup.json [06:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P25557 and previous config saved to /var/cache/conftool/dbconfig/20220420-065216-marostegui.json [06:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [06:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [06:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1, awight, Urbanecm, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] * kart_ is here [07:00:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [07:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:47] (03PS3) 10KartikMistry: Enable SectionTranslation in Test WP for ckb, el, eu, and zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784223 (https://phabricator.wikimedia.org/T304854) [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:16] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [07:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:57] (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation in Test WP for ckb, el, eu, and zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784223 (https://phabricator.wikimedia.org/T304854) (owner: 10KartikMistry) [07:04:40] (03Merged) 10jenkins-bot: Enable SectionTranslation in Test WP for ckb, el, eu, and zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784223 (https://phabricator.wikimedia.org/T304854) (owner: 10KartikMistry) [07:05:54] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [07:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25558 and previous config saved to /var/cache/conftool/dbconfig/20220420-070648-ladsgroup.json [07:06:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:06:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:06:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25559 and previous config saved to /var/cache/conftool/dbconfig/20220420-070702-ladsgroup.json [07:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T306269)', diff saved to https://phabricator.wikimedia.org/P25560 and previous config saved to /var/cache/conftool/dbconfig/20220420-070721-marostegui.json [07:07:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [07:07:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [07:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:26] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [07:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:08:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [07:08:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [07:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:08:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [07:08:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [07:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T306269)', diff saved to https://phabricator.wikimedia.org/P25561 and previous config saved to /var/cache/conftool/dbconfig/20220420-070906-marostegui.json [07:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T306269)', diff saved to https://phabricator.wikimedia.org/P25562 and previous config saved to /var/cache/conftool/dbconfig/20220420-071011-marostegui.json [07:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:30] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:784223|Enable SectionTranslation in Test WP for ckb, el, eu, and zh-yue (T304854 T304862 T304865 T304866)]] (duration: 01m 53s) [07:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:38] T304866: Enable Content and Section Translation for Central Kurdish Wikipedia - https://phabricator.wikimedia.org/T304866 [07:10:38] T304865: Enable Content and Section Translation for Cantonese Wikipedia - https://phabricator.wikimedia.org/T304865 [07:10:38] T304862: Enable Content and Section Translation for Basque Wikipedia - https://phabricator.wikimedia.org/T304862 [07:10:39] T304854: Enable Content and Section Translation for Greek Wikipedia - https://phabricator.wikimedia.org/T304854 [07:13:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:13:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:17:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:17:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [07:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25563 and previous config saved to /var/cache/conftool/dbconfig/20220420-071747-ladsgroup.json [07:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:25:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P25564 and previous config saved to /var/cache/conftool/dbconfig/20220420-072516-marostegui.json [07:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:59] 10SRE-tools, 10Infrastructure-Foundations: Cumin should group similar SSH errors - https://phabricator.wikimedia.org/T306490 (10Majavah) [07:35:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25565 and previous config saved to /var/cache/conftool/dbconfig/20220420-073501-ladsgroup.json [07:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:37:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I'll go ahead and merge" [puppet] - 10https://gerrit.wikimedia.org/r/784320 (owner: 10Zabe) [07:37:49] (03CR) 10Muehlenhoff: [C: 03+2] admin: Update email address for Zabe [puppet] - 10https://gerrit.wikimedia.org/r/784320 (owner: 10Zabe) [07:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P25566 and previous config saved to /var/cache/conftool/dbconfig/20220420-074022-marostegui.json [07:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:14] (03CR) 10Elukey: [C: 03+2] ml-services: add svwiki, tawiki & translatewiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/784620 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [07:47:25] (03PS2) 10Zabe: wikitech: remove absented mw-xml cron [puppet] - 10https://gerrit.wikimedia.org/r/781054 (https://phabricator.wikimedia.org/T273673) [07:49:34] !log T305689: reset crosscluster settings of the elastic chi cluster in eqiad [07:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:39] T305689: Elasticsearch chi@eqiad cluster contains invalid cross cluster settings - https://phabricator.wikimedia.org/T305689 [07:50:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25567 and previous config saved to /var/cache/conftool/dbconfig/20220420-075006-ladsgroup.json [07:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:49] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:07] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:55:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T306269)', diff saved to https://phabricator.wikimedia.org/P25568 and previous config saved to /var/cache/conftool/dbconfig/20220420-075527-marostegui.json [07:55:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [07:55:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [07:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:31] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10fgiunchedi) >>! In T306424#7865083, @Jgiannelos wrote: > Is it an option to bootstrap... [07:55:33] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [07:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T306269)', diff saved to https://phabricator.wikimedia.org/P25569 and previous config saved to /var/cache/conftool/dbconfig/20220420-075535-marostegui.json [07:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:12] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add alerts for exporter-specific unavailability [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [07:56:16] (03PS3) 10Filippo Giunchedi: sre: add alerts for exporter-specific unavailability [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) [07:57:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T306269)', diff saved to https://phabricator.wikimedia.org/P25570 and previous config saved to /var/cache/conftool/dbconfig/20220420-075747-marostegui.json [07:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:22] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1004.eqiad.wmnet [08:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:46] !log reimage pybal-test2003 as buster - T297187 [08:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:50] T297187: Upgrade pybal-test200[23] from Stretch to Buster - https://phabricator.wikimedia.org/T297187 [08:01:53] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:35] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [08:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25571 and previous config saved to /var/cache/conftool/dbconfig/20220420-080511-ladsgroup.json [08:05:13] PROBLEM - Host pybal-test2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:01] (03CR) 10Raymond Ndibe: "Hello David, can someone give me the permission to modify an existing patchset by another user in this repo? I can't seem to modify an exi" [puppet] - 10https://gerrit.wikimedia.org/r/780853 (owner: 10David Caro) [08:06:41] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1004.eqiad.wmnet [08:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:09] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25572 and previous config saved to /var/cache/conftool/dbconfig/20220420-080716-ladsgroup.json [08:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:08:27] (03PS1) 10Filippo Giunchedi: thanos: reload rule via http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/784629 (https://phabricator.wikimedia.org/T288726) [08:08:47] (03CR) 10jerkins-bot: [V: 04-1] thanos: reload rule via http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/784629 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:09:37] (03PS2) 10Filippo Giunchedi: thanos: reload rule via http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/784629 (https://phabricator.wikimedia.org/T288726) [08:09:58] (03CR) 10jerkins-bot: [V: 04-1] thanos: reload rule via http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/784629 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:11:00] (03PS3) 10Filippo Giunchedi: thanos: reload rule via http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/784629 (https://phabricator.wikimedia.org/T288726) [08:11:31] RECOVERY - Host pybal-test2003 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [08:12:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P25573 and previous config saved to /var/cache/conftool/dbconfig/20220420-081253-marostegui.json [08:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:09] I'm seeking a reviewer for a quick change -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/784629 [08:15:02] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10MatthewVernon) If I've understood correctly, and this is a question of "contents of S3... [08:15:34] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1005.eqiad.wmnet [08:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:37] godog: hmm no protocol schema on that curl? [08:16:01] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/784227 (owner: 10Alexandros Kosiaris) [08:17:59] PROBLEM - Host pybal-test2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:18:09] (03CR) 10Vgutierrez: [C: 03+1] thanos: reload rule via http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/784629 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:18:31] 10SRE-tools, 10Infrastructure-Foundations: Cumin should group similar SSH errors - https://phabricator.wikimedia.org/T306490 (10jcrespo) Please note that cumin does group already identical content output. E.g.: ` # cumin 'P:cumin::target%cluster = backup' 'uname -v' 24 hosts will be targeted: backup[2001-2007... [08:18:33] RECOVERY - Host pybal-test2003 is UP: PING OK - Packet loss = 0%, RTA = 32.07 ms [08:18:40] godog: BTW.. I got one for you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/784256 [08:19:19] vgutierrez: thanks! yeah no schema but http is the default anyways, will take a look at your review shortly [08:19:26] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: reload rule via http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/784629 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:19:29] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: reload rule via http endpoint [puppet] - 10https://gerrit.wikimedia.org/r/784629 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:19:51] godog: yeah.. tiny nitpick.. but it's the default today... maybe not tomorrow ;P [08:20:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25574 and previous config saved to /var/cache/conftool/dbconfig/20220420-082016-ladsgroup.json [08:20:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [08:20:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [08:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:59] vgutierrez: haha! I hope to be retired in this "tomorrow" (wishful thinking) [08:21:08] :) [08:21:18] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1005.eqiad.wmnet [08:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:23] vgutierrez: re: your patch, is there an host/vm I can see it in action ? [08:22:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25575 and previous config saved to /var/cache/conftool/dbconfig/20220420-082221-ladsgroup.json [08:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:55] !log mmandere@cumin1001 START - Cookbook sre.puppet.renew-cert for pybal-test2003.codfw.wmnet: Renew puppet certificate - mmandere@cumin1001 [08:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:59] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for pybal-test2003.codfw.wmnet: Renew puppet certificate - mmandere@cumin1001 [08:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:14] 10SRE-tools, 10Infrastructure-Foundations: Cumin should group similar SSH errors - https://phabricator.wikimedia.org/T306490 (10Volans) > Would it be possible to group the similar SSH errors where the only difference is the target hostname? Not currently, as the underlying library that does the grouping does... [08:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P25576 and previous config saved to /var/cache/conftool/dbconfig/20220420-082758-marostegui.json [08:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:09] godog: the new one? I haven't deployed it manually [08:31:20] godog: the old version is up & running on every cp host of course [08:31:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox1001.wikimedia.org [08:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:50] godog: I'll test it manually on a ulsfo, one sec [08:31:59] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10dom_walden) [08:33:17] vgutierrez: ok thanks! that or the test instances in cloud vps is what I had in mind FWIW [08:37:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25577 and previous config saved to /var/cache/conftool/dbconfig/20220420-083726-ladsgroup.json [08:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netbox1001.wikimedia.org [08:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:56] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10fgiunchedi) >>! In T306424#7866991, @MatthewVernon wrote: > If I've understood correct... [08:43:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T306269)', diff saved to https://phabricator.wikimedia.org/P25578 and previous config saved to /var/cache/conftool/dbconfig/20220420-084303-marostegui.json [08:43:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1001.eqiad.wmnet [08:43:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [08:43:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [08:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:10] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [08:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T306269)', diff saved to https://phabricator.wikimedia.org/P25579 and previous config saved to /var/cache/conftool/dbconfig/20220420-084312-marostegui.json [08:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:51] 10SRE, 10LDAP-Access-Requests, 10SRE Observability (FY2021/2022-Q4): Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10Volans) Pending clarification from @dr0ptp4kt on the similar request T306437#7864599 [08:46:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T306269)', diff saved to https://phabricator.wikimedia.org/P25580 and previous config saved to /var/cache/conftool/dbconfig/20220420-084625-marostegui.json [08:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:12] (03PS1) 10Jcrespo: admin: Add placeholder to reserve uid and git 914 for minio-user [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) [08:48:30] (03PS2) 10Jcrespo: admin: Add placeholder to reserve uid and git 914 for minio-user [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) [08:49:07] (03PS3) 10Jcrespo: admin: Add placeholder to reserve uid and gid 914 for minio-user [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) [08:49:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1001.eqiad.wmnet [08:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox2001.wikimedia.org [08:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:30] (03PS4) 10Jcrespo: admin: Add placeholder to reserve uid and gid 914 for minio-user [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) [08:52:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox2001.wikimedia.org [08:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25581 and previous config saved to /var/cache/conftool/dbconfig/20220420-085231-ladsgroup.json [08:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:53:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [08:53:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [08:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25582 and previous config saved to /var/cache/conftool/dbconfig/20220420-085325-ladsgroup.json [08:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Volans) [08:57:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2001.codfw.wmnet [08:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2001.codfw.wmnet [08:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1126.eqiad.wmnet with reason: Rebooting for T303174 [09:00:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1126.eqiad.wmnet with reason: Rebooting for T303174 [09:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:10] !log kormat@cumin1001 dbctl commit (dc=all): 'db1126 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25583 and previous config saved to /var/cache/conftool/dbconfig/20220420-090010-kormat.json [09:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:28] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1029.eqiad.wmnet with reason: Rebooting for T303174 [09:00:29] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1029.eqiad.wmnet with reason: Rebooting for T303174 [09:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2001.wikimedia.org [09:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:10] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:03:39] (03PS1) 10Vgutierrez: cache: Add missing profile::monitoring config on cloud [puppet] - 10https://gerrit.wikimedia.org/r/784634 [09:04:46] (03CR) 10Vgutierrez: [C: 03+2] cache: Add missing profile::monitoring config on cloud [puppet] - 10https://gerrit.wikimedia.org/r/784634 (owner: 10Vgutierrez) [09:05:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:05:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [09:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:37] (03PS1) 10Filippo Giunchedi: thanos: aggregate exporter 'up' metrics [puppet] - 10https://gerrit.wikimedia.org/r/784635 (https://phabricator.wikimedia.org/T288726) [09:05:39] (03PS1) 10Filippo Giunchedi: prometheus: remove per-exporter up checks [puppet] - 10https://gerrit.wikimedia.org/r/784636 (https://phabricator.wikimedia.org/T288726) [09:09:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netbox-dev2001.wikimedia.org [09:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster1004.eqiad.wmnet [09:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:34] (03PS1) 10Volans: admin: re-enable Jim Maddock's account [puppet] - 10https://gerrit.wikimedia.org/r/784637 (https://phabricator.wikimedia.org/T249873) [09:12:50] (03PS1) 10Vgutierrez: cache: Disable ATS monitoring on cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/784638 [09:13:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Volans) [09:13:23] (03PS1) 10Elukey: Add the ores/deploy to deployment-prep's deploy server [puppet] - 10https://gerrit.wikimedia.org/r/784639 [09:13:54] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:14:15] (03CR) 10Elukey: [C: 03+2] Add the ores/deploy to deployment-prep's deploy server [puppet] - 10https://gerrit.wikimedia.org/r/784639 (owner: 10Elukey) [09:14:38] (03CR) 10Vgutierrez: [C: 03+2] cache: Disable ATS monitoring on cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/784638 (owner: 10Vgutierrez) [09:15:08] elukey: may I merge your change as well? [09:15:18] 9296a07e0a [09:15:59] I was about to ask the same, go ahead :) [09:16:08] ackl [09:16:14] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:19] done [09:16:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1004.eqiad.wmnet [09:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:27] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6001.wikimedia.org [09:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:02] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:19:41] ^ acking [09:19:52] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:07] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff Kormat Known issue https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:21:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6001.wikimedia.org [09:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:35] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:51] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:16] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:27:41] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:32] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff Kormat Known issue. https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:29:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb2001-dev.wikimedia.org [09:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:44] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:34:58] PROBLEM - Disk space on ml-staging-ctrl2002 is CRITICAL: DISK CRITICAL - free space: / 1133 MB (5% inode=95%): /tmp 1133 MB (5% inode=95%): /var/tmp 1133 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [09:37:33] ah snap, will check in a bit.. [09:37:42] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:38:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25585 and previous config saved to /var/cache/conftool/dbconfig/20220420-093815-kormat.json [09:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:30] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:40:08] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:54] !log kormat@cumin1001 dbctl commit (dc=all): 'Switch es1 'primary' T303174', diff saved to https://phabricator.wikimedia.org/P25586 and previous config saved to /var/cache/conftool/dbconfig/20220420-094354-kormat.json [09:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:28] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1029.eqiad.wmnet with reason: Rebooting for T303174 [09:44:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1029.eqiad.wmnet with reason: Rebooting for T303174 [09:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:35] !log kormat@cumin1001 dbctl commit (dc=all): 'es1029 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25587 and previous config saved to /var/cache/conftool/dbconfig/20220420-094435-kormat.json [09:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:05] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10MoritzMuehlenhoff) [09:45:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudweb2001-dev.wikimedia.org [09:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:48:27] !log kormat@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25588 and previous config saved to /var/cache/conftool/dbconfig/20220420-094827-kormat.json [09:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Rebooting db1156 T303174 [09:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:41] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Rebooting db1156 T303174 [09:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:51] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1156.eqiad.wmnet with reason: Rebooting for T303174 [09:48:52] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1156.eqiad.wmnet with reason: Rebooting for T303174 [09:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1156 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25589 and previous config saved to /var/cache/conftool/dbconfig/20220420-094857-kormat.json [09:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:52] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1131.eqiad.wmnet with reason: Rebooting for T303174 [09:49:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1131.eqiad.wmnet with reason: Rebooting for T303174 [09:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:57] (03PS1) 10Vgutierrez: cache: Disable monitoring for ats-tls on cloud [puppet] - 10https://gerrit.wikimedia.org/r/784644 [09:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:59] !log kormat@cumin1001 dbctl commit (dc=all): 'db1131 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25590 and previous config saved to /var/cache/conftool/dbconfig/20220420-094958-kormat.json [09:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host labweb1002.wikimedia.org [09:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:14] (03CR) 10Vgutierrez: [C: 03+2] cache: Disable monitoring for ats-tls on cloud [puppet] - 10https://gerrit.wikimedia.org/r/784644 (owner: 10Vgutierrez) [09:51:44] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) I think a good way forward could be to try to bootstrap the new (shared be... [09:52:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1142.eqiad.wmnet with reason: Rebooting for T303174 [09:52:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1142.eqiad.wmnet with reason: Rebooting for T303174 [09:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:10] !log kormat@cumin1001 dbctl commit (dc=all): 'db1142 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25591 and previous config saved to /var/cache/conftool/dbconfig/20220420-095209-kormat.json [09:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:29] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [09:52:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [09:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:36] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25592 and previous config saved to /var/cache/conftool/dbconfig/20220420-095235-kormat.json [09:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:19] !log kormat@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25593 and previous config saved to /var/cache/conftool/dbconfig/20220420-095319-kormat.json [09:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:01] !log kormat@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25594 and previous config saved to /var/cache/conftool/dbconfig/20220420-095401-kormat.json [09:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P25595 and previous config saved to /var/cache/conftool/dbconfig/20220420-095427-root.json [09:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:44] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1156.eqiad.wmnet with reason: Rebooting for T303174 [09:54:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1156.eqiad.wmnet with reason: Rebooting for T303174 [09:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:39] !log kormat@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25596 and previous config saved to /var/cache/conftool/dbconfig/20220420-095638-kormat.json [09:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:34] (03CR) 10Hnowlan: [C: 03+1] Revert "tegola: Point to codfw s3 endpoint for debugging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/783912 (owner: 10Jgiannelos) [09:58:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [09:58:36] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [09:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1026.eqiad.wmnet with reason: Rebooting for T303174 [09:59:06] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1026.eqiad.wmnet with reason: Rebooting for T303174 [09:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:21] (03CR) 10Jgiannelos: [C: 03+2] Revert "tegola: Point to codfw s3 endpoint for debugging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/783912 (owner: 10Jgiannelos) [10:02:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host labweb1002.wikimedia.org [10:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:02:51] !log kormat@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25597 and previous config saved to /var/cache/conftool/dbconfig/20220420-100251-kormat.json [10:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:31] !log kormat@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25598 and previous config saved to /var/cache/conftool/dbconfig/20220420-100331-kormat.json [10:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:40] (03PS2) 10ArielGlenn: include the dumps admins in the dumpsdata role [puppet] - 10https://gerrit.wikimedia.org/r/773195 [10:03:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host labweb1001.wikimedia.org [10:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:26] (03Merged) 10jenkins-bot: Revert "tegola: Point to codfw s3 endpoint for debugging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/783912 (owner: 10Jgiannelos) [10:04:33] (03CR) 10ArielGlenn: [C: 03+2] include the dumps admins in the dumpsdata role [puppet] - 10https://gerrit.wikimedia.org/r/773195 (owner: 10ArielGlenn) [10:04:40] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1156.eqiad.wmnet with reason: Rebooting for T303174 [10:04:42] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1156.eqiad.wmnet with reason: Rebooting for T303174 [10:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:44] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [10:04:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [10:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:07] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm[2001-2003].codfw.wmnet with reason: reboot [10:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm[2001-2003].codfw.wmnet with reason: reboot [10:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:24] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [10:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:29] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [10:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:40] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [10:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:59] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [10:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25599 and previous config saved to /var/cache/conftool/dbconfig/20220420-100823-kormat.json [10:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:05] !log kormat@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25600 and previous config saved to /var/cache/conftool/dbconfig/20220420-100905-kormat.json [10:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P25601 and previous config saved to /var/cache/conftool/dbconfig/20220420-100931-root.json [10:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:42] !log kormat@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25602 and previous config saved to /var/cache/conftool/dbconfig/20220420-101142-kormat.json [10:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:31] (03PS2) 10Volans: admin: re-enable Jim Maddock's account [puppet] - 10https://gerrit.wikimedia.org/r/784637 (https://phabricator.wikimedia.org/T249873) [10:15:43] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:15:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:50] !log kormat@cumin1001 dbctl commit (dc=all): 'es1031 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25603 and previous config saved to /var/cache/conftool/dbconfig/20220420-101549-kormat.json [10:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/784637 (https://phabricator.wikimedia.org/T249873) (owner: 10Volans) [10:16:18] 10SRE: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10Aklapper) [10:16:19] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host labweb1001.wikimedia.org [10:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:41] 10SRE: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10Aklapper) [10:17:06] PROBLEM - Disk space on ml-staging-ctrl2002 is CRITICAL: DISK CRITICAL - free space: / 1126 MB (5% inode=95%): /tmp 1126 MB (5% inode=95%): /var/tmp 1126 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [10:17:55] !log kormat@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25604 and previous config saved to /var/cache/conftool/dbconfig/20220420-101755-kormat.json [10:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:35] !log kormat@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25605 and previous config saved to /var/cache/conftool/dbconfig/20220420-101834-kormat.json [10:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:21] (03CR) 10Jcrespo: [C: 03+1] admin: re-enable Jim Maddock's account [puppet] - 10https://gerrit.wikimedia.org/r/784637 (https://phabricator.wikimedia.org/T249873) (owner: 10Volans) [10:20:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/784633 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [10:21:08] 10SRE: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10Aklapper) @SCherukuwada: I highly recommend to make https://wikitech.wikimedia.org/wiki/Google_Search_Console_access a `#REDIRECT` to https://wikitech.wikimedia.org/wiki/Search_Console_Data as not... [10:21:11] (03CR) 10Volans: [C: 03+2] admin: re-enable Jim Maddock's account [puppet] - 10https://gerrit.wikimedia.org/r/784637 (https://phabricator.wikimedia.org/T249873) (owner: 10Volans) [10:22:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:22:07] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:27] !log kormat@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25606 and previous config saved to /var/cache/conftool/dbconfig/20220420-102327-kormat.json [10:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:09] !log kormat@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25607 and previous config saved to /var/cache/conftool/dbconfig/20220420-102409-kormat.json [10:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P25608 and previous config saved to /var/cache/conftool/dbconfig/20220420-102435-root.json [10:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1156.eqiad.wmnet with reason: Rebooting for T303174 [10:25:05] (03PS1) 10Hnowlan: tegola: bump memory and CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/784651 (https://phabricator.wikimedia.org/T306424) [10:25:06] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1156.eqiad.wmnet with reason: Rebooting for T303174 [10:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:34] 10SRE: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10jcrespo) @Aklapper we recently discussed all of that on recent meeting. I will take care of the SRE clean up and help @SCherukuwada with the changes :-) (I am admin at wikitech and will be able t... [10:26:22] (03CR) 10Jgiannelos: [C: 03+1] tegola: bump memory and CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/784651 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [10:26:46] !log kormat@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25609 and previous config saved to /var/cache/conftool/dbconfig/20220420-102646-kormat.json [10:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Rebooting db1167 T303174 [10:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:00] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Rebooting db1167 T303174 [10:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Volans) @jmads the access patch has been merged, it will be deployed across the fleet within the next 30 minutes. Feel free to close this task once... [10:27:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1167.eqiad.wmnet with reason: Rebooting for T303174 [10:27:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1167.eqiad.wmnet with reason: Rebooting for T303174 [10:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:22] !log kormat@cumin1001 dbctl commit (dc=all): 'db1167 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25610 and previous config saved to /var/cache/conftool/dbconfig/20220420-102722-kormat.json [10:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1167.eqiad.wmnet with reason: Rebooting for T303174 [10:28:06] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1167.eqiad.wmnet with reason: Rebooting for T303174 [10:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:30] 10SRE: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10Aklapper) [10:29:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:29:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:08] (03CR) 10Hnowlan: [C: 03+2] tegola: bump memory and CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/784651 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [10:31:14] !log kormat@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25611 and previous config saved to /var/cache/conftool/dbconfig/20220420-103114-kormat.json [10:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:32:59] !log kormat@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25612 and previous config saved to /var/cache/conftool/dbconfig/20220420-103258-kormat.json [10:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:39] !log kormat@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25613 and previous config saved to /var/cache/conftool/dbconfig/20220420-103338-kormat.json [10:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:01] !log kormat@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25614 and previous config saved to /var/cache/conftool/dbconfig/20220420-103400-kormat.json [10:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:10] (03Merged) 10jenkins-bot: tegola: bump memory and CPU limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/784651 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [10:34:33] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10hnowlan) >>! In T306424#7867246, @Jgiannelos wrote: > I think a good way forward could... [10:34:33] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1032.eqiad.wmnet with reason: Rebooting for T303174 [10:34:35] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1032.eqiad.wmnet with reason: Rebooting for T303174 [10:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:40] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25615 and previous config saved to /var/cache/conftool/dbconfig/20220420-103440-kormat.json [10:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:48] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [10:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:27] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:35:29] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:13] !log kormat@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25616 and previous config saved to /var/cache/conftool/dbconfig/20220420-103913-kormat.json [10:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P25617 and previous config saved to /var/cache/conftool/dbconfig/20220420-103939-root.json [10:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:43] 10SRE: Allow Wikimedia Maps usage on a private project for an university. - https://phabricator.wikimedia.org/T306467 (10Aklapper) 05Open→03Stalled Hi, please fill in the form: **Link to site**: ... **Purpose/details about your project**: ... **Wikimedia Affiliate supporting project**: ... [10:41:10] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1032.eqiad.wmnet with reason: Rebooting for T303174 [10:41:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1032.eqiad.wmnet with reason: Rebooting for T303174 [10:41:26] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:27] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [10:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:50] !log kormat@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25618 and previous config saved to /var/cache/conftool/dbconfig/20220420-104150-kormat.json [10:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:42:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:42:10] 10SRE, 10Search-Console-access-request: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10jcrespo) Feel free to change "You will not be granted access to all of the (hundreds of) Wikimedia-managed domains but only a subset thereof that you have a busi... [10:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T306269)', diff saved to https://phabricator.wikimedia.org/P25619 and previous config saved to /var/cache/conftool/dbconfig/20220420-104214-marostegui.json [10:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:19] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [10:42:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Rebooting db1165 T303174 [10:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:51] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Rebooting db1165 T303174 [10:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1165.eqiad.wmnet with reason: Rebooting for T303174 [10:43:05] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1165.eqiad.wmnet with reason: Rebooting for T303174 [10:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:10] !log kormat@cumin1001 dbctl commit (dc=all): 'db1165 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25620 and previous config saved to /var/cache/conftool/dbconfig/20220420-104310-kormat.json [10:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:14] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:44:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T306269)', diff saved to https://phabricator.wikimedia.org/P25621 and previous config saved to /var/cache/conftool/dbconfig/20220420-104437-marostegui.json [10:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:51] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [10:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:12] PROBLEM - Maps HTTPS on maps1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:45:22] PROBLEM - Maps HTTPS on maps1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:45:24] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:46:18] !log kormat@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25622 and previous config saved to /var/cache/conftool/dbconfig/20220420-104618-kormat.json [10:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:38] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [10:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:14] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:47:18] PROBLEM - Maps HTTPS on maps1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:47:28] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 7.515 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:48:03] !log kormat@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25623 and previous config saved to /var/cache/conftool/dbconfig/20220420-104802-kormat.json [10:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:04] !log kormat@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25624 and previous config saved to /var/cache/conftool/dbconfig/20220420-104904-kormat.json [10:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:18] (03CR) 10Hnowlan: [C: 03+2] Create docker configuration for local development [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249) (owner: 10Roman Stolar) [10:49:21] !log kormat@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25625 and previous config saved to /var/cache/conftool/dbconfig/20220420-104920-kormat.json [10:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:52] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25626 and previous config saved to /var/cache/conftool/dbconfig/20220420-104951-kormat.json [10:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:58] PROBLEM - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:50:12] (03Merged) 10jenkins-bot: Create docker configuration for local development [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/780659 (https://phabricator.wikimedia.org/T305249) (owner: 10Roman Stolar) [10:50:36] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25627 and previous config saved to /var/cache/conftool/dbconfig/20220420-105035-kormat.json [10:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:13] !log kormat@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25628 and previous config saved to /var/cache/conftool/dbconfig/20220420-105112-kormat.json [10:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1143.eqiad.wmnet with reason: Rebooting for T303174 [10:52:00] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1143.eqiad.wmnet with reason: Rebooting for T303174 [10:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:05] !log kormat@cumin1001 dbctl commit (dc=all): 'db1143 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25629 and previous config saved to /var/cache/conftool/dbconfig/20220420-105204-kormat.json [10:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:21] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops: allow certain users to disable puppet on mwdebug hosts - https://phabricator.wikimedia.org/T305979 (10MoritzMuehlenhoff) IO that seems like a reasonable request, but we should discuss this in the next IF SRE meeting My only worry is... [10:52:24] PROBLEM - Maps HTTPS on maps1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:55:48] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:56:47] !log kormat@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25630 and previous config saved to /var/cache/conftool/dbconfig/20220420-105646-kormat.json [10:56:47] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [10:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:54] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.529 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:58:22] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 4.144 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:59:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P25631 and previous config saved to /var/cache/conftool/dbconfig/20220420-105942-marostegui.json [10:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:48] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:59:50] RECOVERY - Maps HTTPS on maps1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:01:22] !log kormat@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25632 and previous config saved to /var/cache/conftool/dbconfig/20220420-110122-kormat.json [11:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:02] RECOVERY - Maps HTTPS on maps1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 4.266 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:02:16] RECOVERY - Maps HTTPS on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 4.784 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:02:54] RECOVERY - Maps HTTPS on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 3.660 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:04:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25633 and previous config saved to /var/cache/conftool/dbconfig/20220420-110408-kormat.json [11:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:15] (03PS1) 10Jgiannelos: maps: Disable replication and make postgres config on codfw/eqiad identical [puppet] - 10https://gerrit.wikimedia.org/r/784656 (https://phabricator.wikimedia.org/T306424) [11:04:24] !log kormat@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25634 and previous config saved to /var/cache/conftool/dbconfig/20220420-110424-kormat.json [11:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:55] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25635 and previous config saved to /var/cache/conftool/dbconfig/20220420-110455-kormat.json [11:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:39] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25636 and previous config saved to /var/cache/conftool/dbconfig/20220420-110539-kormat.json [11:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:51] !log kormat@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25637 and previous config saved to /var/cache/conftool/dbconfig/20220420-111150-kormat.json [11:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet [11:12:16] (03PS1) 10Vgutierrez: cache: disable backends_in_etcd for cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/784657 [11:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:55] (03CR) 10Vgutierrez: [C: 03+2] cache: disable backends_in_etcd for cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/784657 (owner: 10Vgutierrez) [11:13:19] !log kormat@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25638 and previous config saved to /var/cache/conftool/dbconfig/20220420-111319-kormat.json [11:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P25639 and previous config saved to /var/cache/conftool/dbconfig/20220420-111447-marostegui.json [11:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:58] (03PS2) 10Jgiannelos: maps: Disable replication and make postgres config on codfw/eqiad identical [puppet] - 10https://gerrit.wikimedia.org/r/784656 (https://phabricator.wikimedia.org/T306424) [11:16:26] !log kormat@cumin1001 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25640 and previous config saved to /var/cache/conftool/dbconfig/20220420-111626-kormat.json [11:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:30] (03CR) 10Hnowlan: [C: 03+2] maps: Disable replication and make postgres config on codfw/eqiad identical [puppet] - 10https://gerrit.wikimedia.org/r/784656 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [11:18:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet [11:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:12] !log kormat@cumin1001 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25641 and previous config saved to /var/cache/conftool/dbconfig/20220420-111911-kormat.json [11:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2002.codfw.wmnet [11:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:28] !log kormat@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25642 and previous config saved to /var/cache/conftool/dbconfig/20220420-111928-kormat.json [11:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:59] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25643 and previous config saved to /var/cache/conftool/dbconfig/20220420-111959-kormat.json [11:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:43] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25644 and previous config saved to /var/cache/conftool/dbconfig/20220420-112043-kormat.json [11:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:31] 10SRE, 10Search-Console-access-request: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10jcrespo) p:05Triage→03Medium [11:25:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2002.codfw.wmnet [11:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [11:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:37] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on 6 hosts with reason: postgres config change [11:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:42] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on 6 hosts with reason: postgres config change [11:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:56] !log kormat@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25645 and previous config saved to /var/cache/conftool/dbconfig/20220420-112655-kormat.json [11:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:23] !log kormat@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25646 and previous config saved to /var/cache/conftool/dbconfig/20220420-112823-kormat.json [11:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:49] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10lmata) [11:29:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe2001.codfw.wmnet [11:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T306269)', diff saved to https://phabricator.wikimedia.org/P25647 and previous config saved to /var/cache/conftool/dbconfig/20220420-112952-marostegui.json [11:29:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [11:29:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [11:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:57] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [11:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T306269)', diff saved to https://phabricator.wikimedia.org/P25648 and previous config saved to /var/cache/conftool/dbconfig/20220420-113000-marostegui.json [11:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:03] 10SRE, 10SRE-OnFire, 10Observability-Metrics: write up impact estimation procedure - https://phabricator.wikimedia.org/T246739 (10lmata) [11:32:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T306269)', diff saved to https://phabricator.wikimedia.org/P25649 and previous config saved to /var/cache/conftool/dbconfig/20220420-113219-marostegui.json [11:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:40] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10lmata) [11:33:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe2001.codfw.wmnet [11:33:57] 10SRE-OnFire (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [11:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:24] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [11:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:32] !log kormat@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25650 and previous config saved to /var/cache/conftool/dbconfig/20220420-113432-kormat.json [11:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:03] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25651 and previous config saved to /var/cache/conftool/dbconfig/20220420-113503-kormat.json [11:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:47] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25652 and previous config saved to /var/cache/conftool/dbconfig/20220420-113547-kormat.json [11:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:59] !log kormat@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25653 and previous config saved to /var/cache/conftool/dbconfig/20220420-114159-kormat.json [11:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:45] 10SRE-OnFire (FY2021/2022-Q4): incidents occurring during Q4 have been scored with the scorecard - https://phabricator.wikimedia.org/T306511 (10lmata) [11:43:27] !log kormat@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25654 and previous config saved to /var/cache/conftool/dbconfig/20220420-114326-kormat.json [11:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P25655 and previous config saved to /var/cache/conftool/dbconfig/20220420-114727-marostegui.json [11:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe2002.codfw.wmnet [11:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/784272 (owner: 10Ssingh) [11:57:00] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:57:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe2002.codfw.wmnet [11:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe1001.eqiad.wmnet [11:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:49] (03PS1) 10Hnowlan: Revert "tegola: bump memory and CPU limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/783923 [12:00:23] (03CR) 10Jgiannelos: [C: 03+1] Revert "tegola: bump memory and CPU limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/783923 (owner: 10Hnowlan) [12:01:37] (03PS1) 10Hnowlan: tegola: increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/784660 (https://phabricator.wikimedia.org/T306424) [12:02:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P25656 and previous config saved to /var/cache/conftool/dbconfig/20220420-120232-marostegui.json [12:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:28] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10fgiunchedi) >>! In T306424#7867426, @hnowlan wrote: >>>! In T306424#7867246, @Jgiannel... [12:04:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe1001.eqiad.wmnet [12:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:52] (03CR) 10Hnowlan: [C: 03+2] Revert "tegola: bump memory and CPU limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/783923 (owner: 10Hnowlan) [12:07:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe1002.eqiad.wmnet [12:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:53] (03CR) 10Filippo Giunchedi: "Safe to merge once I186c872eb5 has accumulated some data" [puppet] - 10https://gerrit.wikimedia.org/r/784636 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [12:09:53] (03Merged) 10jenkins-bot: Revert "tegola: bump memory and CPU limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/783923 (owner: 10Hnowlan) [12:10:32] (03CR) 10Phedenskog: grafana: provision JSON datasource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [12:13:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe1002.eqiad.wmnet [12:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:10] (03PS1) 10Kevin Bazira: ml-services: update translatewiki predictor image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/784662 (https://phabricator.wikimedia.org/T306501) [12:16:23] (03CR) 10Elukey: ml-services: update translatewiki predictor image version (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/784662 (https://phabricator.wikimedia.org/T306501) (owner: 10Kevin Bazira) [12:17:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster1005.eqiad.wmnet [12:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T306269)', diff saved to https://phabricator.wikimedia.org/P25657 and previous config saved to /var/cache/conftool/dbconfig/20220420-121737-marostegui.json [12:17:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [12:17:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [12:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:43] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [12:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T306269)', diff saved to https://phabricator.wikimedia.org/P25658 and previous config saved to /var/cache/conftool/dbconfig/20220420-121745-marostegui.json [12:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:15] (03PS2) 10Kevin Bazira: ml-services: update translatewiki predictor image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/784662 (https://phabricator.wikimedia.org/T306501) [12:20:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T306269)', diff saved to https://phabricator.wikimedia.org/P25659 and previous config saved to /var/cache/conftool/dbconfig/20220420-122000-marostegui.json [12:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:16] (03CR) 10Kevin Bazira: ml-services: update translatewiki predictor image version (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/784662 (https://phabricator.wikimedia.org/T306501) (owner: 10Kevin Bazira) [12:20:38] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:21:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1005.eqiad.wmnet [12:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:05] (03CR) 10Elukey: [C: 03+2] ml-services: update translatewiki predictor image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/784662 (https://phabricator.wikimedia.org/T306501) (owner: 10Kevin Bazira) [12:32:05] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:32:32] (03CR) 10David Caro: DONOTMERGE: skeleteon for the replicaconfig service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780853 (owner: 10David Caro) [12:33:29] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:52] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:00] !log reboot conf2004, conf1004 [12:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:08] PROBLEM - Host conf1004 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:28] PROBLEM - Host conf2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:38:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P25660 and previous config saved to /var/cache/conftool/dbconfig/20220420-123807-marostegui.json [12:38:08] RECOVERY - Host conf1004 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [12:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:16] RECOVERY - Host conf2004 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [12:38:48] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:39:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1147.eqiad.wmnet with reason: Rebooting for T303174 [12:39:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1147.eqiad.wmnet with reason: Rebooting for T303174 [12:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:04] !log kormat@cumin1001 dbctl commit (dc=all): 'db1147 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25661 and previous config saved to /var/cache/conftool/dbconfig/20220420-124004-kormat.json [12:40:07] !log installing webperf1003 T305460 [12:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:12] T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 [12:40:35] api_appserver did have a latency spike but looks like it is subsiding [12:40:47] so did appserver cluster [12:40:57] oh, parsoid too [12:41:08] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:41:44] probably related to conf1004 being rebooted [12:43:32] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 8 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [12:44:24] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 23 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [12:44:42] (03CR) 10Ladsgroup: "One tiny thing and that has my blessing." [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [12:44:49] !log kormat@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25662 and previous config saved to /var/cache/conftool/dbconfig/20220420-124448-kormat.json [12:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:56] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1172.eqiad.wmnet with reason: Rebooting for T303174 [12:44:57] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1172.eqiad.wmnet with reason: Rebooting for T303174 [12:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:03] !log kormat@cumin1001 dbctl commit (dc=all): 'db1172 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25663 and previous config saved to /var/cache/conftool/dbconfig/20220420-124502-kormat.json [12:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:30] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1032.eqiad.wmnet with reason: Rebooting for T303174 [12:45:32] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1032.eqiad.wmnet with reason: Rebooting for T303174 [12:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:37] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25664 and previous config saved to /var/cache/conftool/dbconfig/20220420-124537-kormat.json [12:45:39] (03PS6) 10Marostegui: monitor_eventscheduler.pp: Monitor event_scheduler on tests hosts [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) [12:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:52] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:45:56] (03PS1) 10Vgutierrez: cache: Fix varnish-frontend puppetization on non-etcd environments [puppet] - 10https://gerrit.wikimedia.org/r/784666 [12:47:12] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34913/console" [puppet] - 10https://gerrit.wikimedia.org/r/784666 (owner: 10Vgutierrez) [12:48:10] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10TheresNoTime) [12:49:20] !log kormat@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25665 and previous config saved to /var/cache/conftool/dbconfig/20220420-124920-kormat.json [12:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:29] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25666 and previous config saved to /var/cache/conftool/dbconfig/20220420-124926-kormat.json [12:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:53] (03PS2) 10Alexandros Kosiaris: helmfile.d: Remove all reference to tillerNamespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/784227 (https://phabricator.wikimedia.org/T251305) [12:50:12] PROBLEM - Host conf2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:30] RECOVERY - Host conf2005 is UP: PING OK - Packet loss = 0%, RTA = 33.31 ms [12:52:36] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:52:58] (03CR) 10Ladsgroup: monitor_eventscheduler.pp: Monitor event_scheduler on tests hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [12:53:03] (03CR) 10Ladsgroup: [C: 03+1] monitor_eventscheduler.pp: Monitor event_scheduler on tests hosts [puppet] - 10https://gerrit.wikimedia.org/r/784583 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [12:53:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P25667 and previous config saved to /var/cache/conftool/dbconfig/20220420-125312-marostegui.json [12:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:54] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:57:22] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:58:02] !log reboot conf2006, conf1006 [12:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] helmfile.d: Remove all reference to tillerNamespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/784227 (https://phabricator.wikimedia.org/T251305) (owner: 10Alexandros Kosiaris) [12:59:02] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1168.eqiad.wmnet with reason: Rebooting for T303174 [12:59:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1168.eqiad.wmnet with reason: Rebooting for T303174 [12:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:09] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25668 and previous config saved to /var/cache/conftool/dbconfig/20220420-125909-kormat.json [12:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:53] !log kormat@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25669 and previous config saved to /var/cache/conftool/dbconfig/20220420-125952-kormat.json [12:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T1300). [13:00:05] subbu and arlolra: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:40] PROBLEM - Host conf1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:01:31] o/ [13:01:46] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:03:07] (03Merged) 10jenkins-bot: helmfile.d: Remove all reference to tillerNamespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/784227 (https://phabricator.wikimedia.org/T251305) (owner: 10Alexandros Kosiaris) [13:03:24] PROBLEM - Check systemd state on conf2005 is CRITICAL: CRITICAL - degraded: The following units failed: etcdmirror-conftool-eqiad-wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:45] (JobUnavailable) firing: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:03:51] PROBLEM - Etcd replication lag #page on conf2005 is CRITICAL: connect to address 10.192.32.52 and port 8000: Connection refused https://wikitech.wikimedia.org/wiki/Etcd [13:04:00] o/ I can deploy if no-one else is around [13:04:04] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:04:13] * volans here [13:04:18] here [13:04:18] * Emperor here [13:04:24] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2005 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:04:38] PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:04:41] akosiaris: related to the restart !logs above? [13:04:44] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [13:04:45] dbctl is broken due to etcd issues [13:04:52] did conf host going down caused mw problems [13:05:09] or was something else causing issues on both? [13:05:09] yes, related to the restart [13:05:16] RECOVERY - Host conf1006 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [13:05:16] PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:05:23] ah, restart == trigger issue? [13:05:27] around [13:05:28] here as well [13:05:33] but do I need to be? [13:05:41] etcd-mirror couldn't connect on conf2005 fwiw [13:05:49] should be fixed in a couple of secs [13:05:50] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) Sounds good [13:05:55] I see cluster-healthy on both sides now [13:05:58] should we delay the mediawiki backport/config window? [13:05:58] (03PS1) 10Kevin Bazira: ml-services: update editquality predictor image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/784668 (https://phabricator.wikimedia.org/T306501) [13:06:05] pybal already recovered on lvs6001 [13:06:14] icinga should follow soon :) [13:06:24] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [13:06:26] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [13:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:11] * taavi waiting before deploying anything until instructed otherwise [13:07:29] I am going to check if there was user impact [13:07:55] akosiaris: should we restart etcdmirror? [13:08:04] I was about to ask the same :D [13:08:10] (03PS1) 10Ottomata: Add cparel and mfossati to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/784669 (https://phabricator.wikimedia.org/T306057) [13:08:13] jynus: AIUI I would expect not - per https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster stuff should keep going while etcd is away [13:08:23] subbu: in the meantime: why are those changes being backported? [13:08:23] otherwise we'd need to wait for puppet to bring it up, its unit is failed [13:08:24] * Emperor reserves the right to be wrong [13:08:34] Emperor: I wouldn't expect it, but I want to make sure theory == reality [13:08:59] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25670 and previous config saved to /var/cache/conftool/dbconfig/20220420-130859-kormat.json [13:09:01] we were late updating vendor with the parsoid changes .. so, we were hoping to get those rolled out in this train. [13:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:04] Emperor: at the very least, e.g. dbctl gets impacted [13:09:06] !log kormat@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25671 and previous config saved to /var/cache/conftool/dbconfig/20220420-130905-kormat.json [13:09:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [13:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [13:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T306269)', diff saved to https://phabricator.wikimedia.org/P25672 and previous config saved to /var/cache/conftool/dbconfig/20220420-130914-marostegui.json [13:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:17] taavi, but, they are definitely not critical and can be delayed to next week if it is a concern. [13:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:20] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [13:09:44] elukey: yeah, makes sense [13:10:00] vgutierrez: re lvs/cache, I see no impact, agree? [13:10:14] RECOVERY - PyBal connections to etcd on lvs3005 is OK: OK: 12 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:10:14] RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [13:10:14] RECOVERY - PyBal connections to etcd on lvs6001 is OK: OK: 12 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:10:22] jynus: indeed [13:10:25] akosiaris: ack doing it [13:10:31] !log restart etcdmirror on conf2005 [13:10:34] checking mw servers now [13:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:38] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1168.eqiad.wmnet with reason: Rebooting for T303174 [13:10:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1168.eqiad.wmnet with reason: Rebooting for T303174 [13:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:12] lvs1017 and lvs1018 require a pybal restart, proceeding [13:11:12] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2005 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:11:40] I think there is spikes on latency on app server side until etcd times out [13:11:54] well, until app server times out conencting etcd [13:11:57] !log restarting pybal on lvs1018 [13:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Change es2 'master' to es1026 T303174', diff saved to https://phabricator.wikimedia.org/P25673 and previous config saved to /var/cache/conftool/dbconfig/20220420-131222-kormat.json [13:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T306269)', diff saved to https://phabricator.wikimedia.org/P25674 and previous config saved to /var/cache/conftool/dbconfig/20220420-131228-marostegui.json [13:12:32] RECOVERY - Check systemd state on conf2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:41] jynus: for what athe appservers should connect to etcd on a live request? [13:12:44] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1030.eqiad.wmnet with reason: Rebooting for T303174 [13:12:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1030.eqiad.wmnet with reason: Rebooting for T303174 [13:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:49] sorry about that, I left out conf1005 on purpose to avoid this [13:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:51] !log kormat@cumin1001 dbctl commit (dc=all): 'es1030 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25675 and previous config saved to /var/cache/conftool/dbconfig/20220420-131251-kormat.json [13:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:57] RECOVERY - Etcd replication lag #page on conf2005 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Etcd [13:13:16] volans: cannot say- but see: https://phabricator.wikimedia.org/T299977 [13:13:45] (JobUnavailable) resolved: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:13:56] subbu: just that I haven't backported a vendor change before and I'm not fully confident in my abilites to do it without taking all of group1 down, so I'd prefer not to do that unless necessary [13:14:02] jynus: wrong link? I don't see etcd there at all :D [13:14:04] s/group1/group0/ [13:14:14] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 36 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [13:14:23] !log restarting pybal on lvs1017 [13:14:24] volans: etcd was down at the time [13:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:41] but high latency alert happened at the same time as etcd being down [13:14:45] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25676 and previous config saved to /var/cache/conftool/dbconfig/20220420-131444-kormat.json [13:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:54] taavi, ok, lets skip it then. [13:14:57] !log kormat@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25677 and previous config saved to /var/cache/conftool/dbconfig/20220420-131456-kormat.json [13:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:01] maybe it is something else? [13:15:18] subbu: ack! sorry about that :/ [13:15:45] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:15:58] interestingly, s8 traffic was the most impacted [13:16:32] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:16:51] jynus: IIUC the latency spike was only for codfw appservers, not eqiad [13:16:52] volans: just to be clear, I don't know those are related- but if there was a production issue, that is the main concern [13:16:57] taavi, no worries. thanks. [13:17:15] elukey: ah, true [13:17:36] that is a little weird but there is probably an explanation with etcd, if we check logs [13:17:58] so then it seems like no impact on production users [13:18:51] maybe it shows big changes because the etcd uncached requests is a significant amount of request, elukey? as it has no real user traffic [13:20:21] I think we can just move on anyway [13:22:35] yeah I am trying to check some proof but didn't find much [13:23:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [13:23:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [13:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:25] !log kormat@cumin1001 dbctl commit (dc=all): 'es1031 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25678 and previous config saved to /var/cache/conftool/dbconfig/20220420-132325-kormat.json [13:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:41] it could be just other unrelated maintenance- there is always something ongoing on codfw and it is "misstreated" when depooled from traffic [13:24:09] !log kormat@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25679 and previous config saved to /var/cache/conftool/dbconfig/20220420-132409-kormat.json [13:24:10] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25680 and previous config saved to /var/cache/conftool/dbconfig/20220420-132410-kormat.json [13:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P25681 and previous config saved to /var/cache/conftool/dbconfig/20220420-132733-marostegui.json [13:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:39] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache: Fix varnish-frontend puppetization on non-etcd environments [puppet] - 10https://gerrit.wikimedia.org/r/784666 (owner: 10Vgutierrez) [13:29:39] (03PS1) 10Majavah: openstack: fix not found responses [puppet] - 10https://gerrit.wikimedia.org/r/784674 [13:29:49] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25682 and previous config saved to /var/cache/conftool/dbconfig/20220420-132948-kormat.json [13:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:01] !log kormat@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25683 and previous config saved to /var/cache/conftool/dbconfig/20220420-133000-kormat.json [13:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:45] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [13:30:47] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [13:30:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1030.eqiad.wmnet with reason: Rebooting for T303174 [13:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:49] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1030.eqiad.wmnet with reason: Rebooting for T303174 [13:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:33] (03CR) 10Andrew Bogott: [C: 03+2] openstack: fix not found responses [puppet] - 10https://gerrit.wikimedia.org/r/784674 (owner: 10Majavah) [13:31:55] vgutierrez: quick q: would pybal be ok eventually if you hadn't restarted it on lvs1017 and lvs1018? [13:33:06] eventually :) [13:33:10] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1148.eqiad.wmnet with reason: Rebooting for T303174 [13:33:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1148.eqiad.wmnet with reason: Rebooting for T303174 [13:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:17] !log kormat@cumin1001 dbctl commit (dc=all): 'db1148 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25684 and previous config saved to /var/cache/conftool/dbconfig/20220420-133317-kormat.json [13:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:40] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1021.eqiad.wmnet with reason: Rebooting for T303174 [13:35:40] (03PS1) 10Majavah: openstack: wmf_sink: fix response variable name [puppet] - 10https://gerrit.wikimedia.org/r/784677 [13:35:41] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1021.eqiad.wmnet with reason: Rebooting for T303174 [13:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:47] !log kormat@cumin1001 dbctl commit (dc=all): 'es1021 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25685 and previous config saved to /var/cache/conftool/dbconfig/20220420-133546-kormat.json [13:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:15] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1024.eqiad.wmnet with reason: Rebooting for T303174 [13:36:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1024.eqiad.wmnet with reason: Rebooting for T303174 [13:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:19] (03CR) 10Andrew Bogott: [C: 03+2] openstack: wmf_sink: fix response variable name [puppet] - 10https://gerrit.wikimedia.org/r/784677 (owner: 10Majavah) [13:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:22] !log kormat@cumin1001 dbctl commit (dc=all): 'es1024 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25686 and previous config saved to /var/cache/conftool/dbconfig/20220420-133622-kormat.json [13:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:59] !log kormat@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25687 and previous config saved to /var/cache/conftool/dbconfig/20220420-133757-kormat.json [13:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:13] !log kormat@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25688 and previous config saved to /var/cache/conftool/dbconfig/20220420-133913-kormat.json [13:39:14] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25689 and previous config saved to /var/cache/conftool/dbconfig/20220420-133914-kormat.json [13:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:41] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34914/console" [puppet] - 10https://gerrit.wikimedia.org/r/784669 (https://phabricator.wikimedia.org/T306057) (owner: 10Ottomata) [13:40:53] (03PS2) 10Hnowlan: tegola: increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/784660 (https://phabricator.wikimedia.org/T306424) [13:42:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P25690 and previous config saved to /var/cache/conftool/dbconfig/20220420-134238-marostegui.json [13:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:53] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25691 and previous config saved to /var/cache/conftool/dbconfig/20220420-134452-kormat.json [13:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:25] jouncebot: nowandnext [13:45:25] For the next 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T1300) [13:45:25] In 4 hour(s) and 14 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T1800) [13:45:25] In 4 hour(s) and 14 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T1800) [13:45:45] taavi: are you done? Can I deploy the test commons thingy? [13:45:59] Amir1: sure [13:46:11] (03PS4) 10Ladsgroup: filebackend: Fix link to thumb url in testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) [13:46:14] (03CR) 10Ladsgroup: [C: 03+2] filebackend: Fix link to thumb url in testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) (owner: 10Ladsgroup) [13:46:58] (03Merged) 10jenkins-bot: filebackend: Fix link to thumb url in testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) (owner: 10Ladsgroup) [13:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:48:32] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:784311|filebackend: Fix link to thumb url in testcommonswiki (T306139)]] (duration: 00m 53s) [13:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:37] T306139: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 [13:50:18] 10SRE-swift-storage, 10Patch-For-Review, 10User-Ladsgroup: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [13:51:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:51:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:53] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1021.eqiad.wmnet with reason: Rebooting for T303174 [13:52:55] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1021.eqiad.wmnet with reason: Rebooting for T303174 [13:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:59] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [13:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:00] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1127.eqiad.wmnet with reason: Rebooting for T303174 [13:53:02] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 14 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:53:02] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1030.eqiad.wmnet with reason: Rebooting for T303174 [13:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:03] !log kormat@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25692 and previous config saved to /var/cache/conftool/dbconfig/20220420-135302-kormat.json [13:53:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1030.eqiad.wmnet with reason: Rebooting for T303174 [13:53:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [13:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:06] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1031.eqiad.wmnet with reason: Rebooting for T303174 [13:53:09] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1024.eqiad.wmnet with reason: Rebooting for T303174 [13:53:10] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1024.eqiad.wmnet with reason: Rebooting for T303174 [13:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:56] (03CR) 10Hnowlan: [C: 03+2] tegola: increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/784660 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [13:54:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10Volans) For contractors we usually grant the `ldap/nda` group instead, at the practical level they are almost equivalent, so that should work too. @TheresNoTime Would be ok for you t... [13:54:17] !log kormat@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25693 and previous config saved to /var/cache/conftool/dbconfig/20220420-135417-kormat.json [13:54:18] !log kormat@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: repooling T303174', diff saved to https://phabricator.wikimedia.org/P25694 and previous config saved to /var/cache/conftool/dbconfig/20220420-135417-kormat.json [13:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25695 and previous config saved to /var/cache/conftool/dbconfig/20220420-135623-kormat.json [13:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:49] !log kormat@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25696 and previous config saved to /var/cache/conftool/dbconfig/20220420-135648-kormat.json [13:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:32] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:57:41] !log kormat@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25697 and previous config saved to /var/cache/conftool/dbconfig/20220420-135740-kormat.json [13:57:43] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10TheresNoTime) >>! In T306518#7868047, @Volans wrote: > For contractors we usually grant the `ldap/nda` group instead, at the practical level they are almost equivalent, so that shoul... [13:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:47] (03PS1) 10Ladsgroup: wmnet: Update s8-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/784678 (https://phabricator.wikimedia.org/T303927) [13:57:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T306269)', diff saved to https://phabricator.wikimedia.org/P25698 and previous config saved to /var/cache/conftool/dbconfig/20220420-135750-marostegui.json [13:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:55] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [13:58:23] (03Merged) 10jenkins-bot: tegola: increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/784660 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [13:58:41] !log kormat@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25699 and previous config saved to /var/cache/conftool/dbconfig/20220420-135841-kormat.json [13:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:18] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10Volans) 05Open→03Resolved a:03Volans Granted `ldap/nda` group, confirmation of NDA on file is in T249873#7865953. Resolving. [13:59:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25700 and previous config saved to /var/cache/conftool/dbconfig/20220420-135956-kormat.json [14:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:22] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1177.eqiad.wmnet with reason: Rebooting for T303174 [14:00:24] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1177.eqiad.wmnet with reason: Rebooting for T303174 [14:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:29] !log kormat@cumin1001 dbctl commit (dc=all): 'db1177 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25701 and previous config saved to /var/cache/conftool/dbconfig/20220420-140029-kormat.json [14:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:00:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25702 and previous config saved to /var/cache/conftool/dbconfig/20220420-140105-ladsgroup.json [14:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:08] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1322 is CRITICAL: etcd last index (474291) is outdated compared to the master one (474294) https://wikitech.wikimedia.org/wiki/Etcd [14:01:08] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1331 is CRITICAL: etcd last index (474291) is outdated compared to the master one (474294) https://wikitech.wikimedia.org/wiki/Etcd [14:01:08] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1381 is CRITICAL: etcd last index (474291) is outdated compared to the master one (474294) https://wikitech.wikimedia.org/wiki/Etcd [14:01:10] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2295 is CRITICAL: etcd last index (178453) is outdated compared to the master one (178459) https://wikitech.wikimedia.org/wiki/Etcd [14:01:10] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2283 is CRITICAL: etcd last index (178453) is outdated compared to the master one (178459) https://wikitech.wikimedia.org/wiki/Etcd [14:01:10] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2322 is CRITICAL: etcd last index (178453) is outdated compared to the master one (178459) https://wikitech.wikimedia.org/wiki/Etcd [14:01:10] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2338 is CRITICAL: etcd last index (178453) is outdated compared to the master one (178459) https://wikitech.wikimedia.org/wiki/Etcd [14:01:10] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1425 is CRITICAL: etcd last index (474291) is outdated compared to the master one (474294) https://wikitech.wikimedia.org/wiki/Etcd [14:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:11] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1413 is CRITICAL: etcd last index (474291) is outdated compared to the master one (474294) https://wikitech.wikimedia.org/wiki/Etcd [14:01:11] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1417 is CRITICAL: etcd last index (474291) is outdated compared to the master one (474294) https://wikitech.wikimedia.org/wiki/Etcd [14:01:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:01:12] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2315 is CRITICAL: etcd last index (178453) is outdated compared to the master one (178459) https://wikitech.wikimedia.org/wiki/Etcd [14:01:12] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2273 is CRITICAL: etcd last index (178453) is outdated compared to the master one (178459) https://wikitech.wikimedia.org/wiki/Etcd [14:01:13] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2367 is CRITICAL: etcd last index (178453) is outdated compared to the master one (178459) https://wikitech.wikimedia.org/wiki/Etcd [14:01:13] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2326 is CRITICAL: etcd last index (178453) is outdated compared to the master one (178459) https://wikitech.wikimedia.org/wiki/Etcd [14:01:16] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [14:01:17] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1180.eqiad.wmnet with reason: Rebooting for T303174 [14:01:18] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1180.eqiad.wmnet with reason: Rebooting for T303174 [14:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1180 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25703 and previous config saved to /var/cache/conftool/dbconfig/20220420-140123-kormat.json [14:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:29] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [14:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:22] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1331 is OK: etcd last index (474297) matches the master one (474297) https://wikitech.wikimedia.org/wiki/Etcd [14:03:22] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1322 is OK: etcd last index (474297) matches the master one (474297) https://wikitech.wikimedia.org/wiki/Etcd [14:03:22] (03PS1) 10Ladsgroup: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/784681 (https://phabricator.wikimedia.org/T306001) [14:03:24] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1381 is OK: etcd last index (474297) matches the master one (474297) https://wikitech.wikimedia.org/wiki/Etcd [14:03:24] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2295 is OK: etcd last index (178465) matches the master one (178465) https://wikitech.wikimedia.org/wiki/Etcd [14:03:24] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2338 is OK: etcd last index (178465) matches the master one (178465) https://wikitech.wikimedia.org/wiki/Etcd [14:03:24] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2283 is OK: etcd last index (178465) matches the master one (178465) https://wikitech.wikimedia.org/wiki/Etcd [14:03:24] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2322 is OK: etcd last index (178465) matches the master one (178465) https://wikitech.wikimedia.org/wiki/Etcd [14:03:25] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1425 is OK: etcd last index (474297) matches the master one (474297) https://wikitech.wikimedia.org/wiki/Etcd [14:03:25] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1417 is OK: etcd last index (474297) matches the master one (474297) https://wikitech.wikimedia.org/wiki/Etcd [14:03:26] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1413 is OK: etcd last index (474297) matches the master one (474297) https://wikitech.wikimedia.org/wiki/Etcd [14:03:28] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2315 is OK: etcd last index (178465) matches the master one (178465) https://wikitech.wikimedia.org/wiki/Etcd [14:03:28] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2273 is OK: etcd last index (178465) matches the master one (178465) https://wikitech.wikimedia.org/wiki/Etcd [14:03:28] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2367 is OK: etcd last index (178465) matches the master one (178465) https://wikitech.wikimedia.org/wiki/Etcd [14:03:28] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2326 is OK: etcd last index (178465) matches the master one (178465) https://wikitech.wikimedia.org/wiki/Etcd [14:05:46] !log kormat@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25704 and previous config saved to /var/cache/conftool/dbconfig/20220420-140546-kormat.json [14:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:11] (03PS2) 10Ladsgroup: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/784681 (https://phabricator.wikimedia.org/T303927) [14:07:12] !log kormat@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25705 and previous config saved to /var/cache/conftool/dbconfig/20220420-140711-kormat.json [14:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:07] !log kormat@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25706 and previous config saved to /var/cache/conftool/dbconfig/20220420-140806-kormat.json [14:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:33] (03CR) 10Marostegui: [C: 03+1] mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/784681 (https://phabricator.wikimedia.org/T303927) (owner: 10Ladsgroup) [14:10:11] (03CR) 10Marostegui: [C: 03+1] wmnet: Update s8-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/784678 (https://phabricator.wikimedia.org/T303927) (owner: 10Ladsgroup) [14:11:27] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25707 and previous config saved to /var/cache/conftool/dbconfig/20220420-141127-kormat.json [14:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:52] !log kormat@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25708 and previous config saved to /var/cache/conftool/dbconfig/20220420-141152-kormat.json [14:11:53] !log kormat@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25709 and previous config saved to /var/cache/conftool/dbconfig/20220420-141152-kormat.json [14:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:09] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2090.codfw.wmnet with reason: Rebooting for T303174 [14:12:11] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2090.codfw.wmnet with reason: Rebooting for T303174 [14:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:44] !log kormat@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25710 and previous config saved to /var/cache/conftool/dbconfig/20220420-141244-kormat.json [14:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:45] !log kormat@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25711 and previous config saved to /var/cache/conftool/dbconfig/20220420-141345-kormat.json [14:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:34] (03CR) 10Andrew Bogott: [C: 03+2] wikitech: remove absented mw-xml cron [puppet] - 10https://gerrit.wikimedia.org/r/781054 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:18:10] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10bd808) [14:18:53] (03PS1) 10Ayounsi: wmf-netbox: remove the need for fetch_device_circuits [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/784683 (https://phabricator.wikimedia.org/T259166) [14:20:50] !log kormat@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25712 and previous config saved to /var/cache/conftool/dbconfig/20220420-142050-kormat.json [14:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:21] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10ayounsi) [14:21:43] 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Move device attributes - https://phabricator.wikimedia.org/T259166 (10ayounsi) 05Resolved→03Open This revealed a limitation in Netbox. After investigation the short version is that Netbox caches the device connected to a cable for... [14:22:15] (03CR) 10CDanis: [C: 03+1] cache: Fix varnish-frontend puppetization on non-etcd environments [puppet] - 10https://gerrit.wikimedia.org/r/784666 (owner: 10Vgutierrez) [14:22:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25713 and previous config saved to /var/cache/conftool/dbconfig/20220420-142215-kormat.json [14:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:11] !log kormat@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25714 and previous config saved to /var/cache/conftool/dbconfig/20220420-142310-kormat.json [14:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:18] !log installing webperf1004 T305460 [14:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:24] T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 [14:25:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1149.eqiad.wmnet with reason: Rebooting for T303174 [14:25:21] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1149.eqiad.wmnet with reason: Rebooting for T303174 [14:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:27] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25715 and previous config saved to /var/cache/conftool/dbconfig/20220420-142526-kormat.json [14:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:01] (03CR) 10Raymond Ndibe: DONOTMERGE: skeleteon for the replicaconfig service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780853 (owner: 10David Caro) [14:26:11] (03PS1) 10Ladsgroup: ActorMigration: Start reading from rev_actor field in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784685 (https://phabricator.wikimedia.org/T275246) [14:26:31] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25716 and previous config saved to /var/cache/conftool/dbconfig/20220420-142630-kormat.json [14:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:56] !log kormat@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25717 and previous config saved to /var/cache/conftool/dbconfig/20220420-142656-kormat.json [14:26:57] !log kormat@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25718 and previous config saved to /var/cache/conftool/dbconfig/20220420-142656-kormat.json [14:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:48] !log kormat@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25719 and previous config saved to /var/cache/conftool/dbconfig/20220420-142748-kormat.json [14:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:52] (03PS8) 10CDanis: Proof of concept for haproxy statistics tracking [puppet] - 10https://gerrit.wikimedia.org/r/784309 [14:28:49] !log kormat@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25720 and previous config saved to /var/cache/conftool/dbconfig/20220420-142848-kormat.json [14:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25721 and previous config saved to /var/cache/conftool/dbconfig/20220420-142957-kormat.json [14:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:53] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): updating wmf-puppet-dashboard for keystone authentication support (codwf1dev) [14:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:32:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25722 and previous config saved to /var/cache/conftool/dbconfig/20220420-143258-ladsgroup.json [14:33:01] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): updating wmf-puppet-dashboard for keystone authentication support (codwf1dev) (duration: 02m 08s) [14:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:04] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25723 and previous config saved to /var/cache/conftool/dbconfig/20220420-143554-kormat.json [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:19] !log kormat@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25724 and previous config saved to /var/cache/conftool/dbconfig/20220420-143719-kormat.json [14:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:52] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [14:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:09] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [14:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:35] !log kormat@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25725 and previous config saved to /var/cache/conftool/dbconfig/20220420-144134-kormat.json [14:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:00] !log kormat@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25726 and previous config saved to /var/cache/conftool/dbconfig/20220420-144159-kormat.json [14:42:01] !log kormat@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25727 and previous config saved to /var/cache/conftool/dbconfig/20220420-144200-kormat.json [14:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:52] !log kormat@cumin1001 dbctl commit (dc=all): 'es1021 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25728 and previous config saved to /var/cache/conftool/dbconfig/20220420-144252-kormat.json [14:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:13] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for AAssaf - https://phabricator.wikimedia.org/T306437 (10Jdforrester-WMF) >>! In T306437#7864599, @Volans wrote: > @dr0ptp4kt could you please clarify if this access request (and the other related to the same project) is instead for the NDA group more th... [14:43:34] (03PS1) 10Jdrewniak: Add wgWMEWebUIScrollTrackingSamplingRate config to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784690 (https://phabricator.wikimedia.org/T303297) [14:43:53] !log kormat@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25729 and previous config saved to /var/cache/conftool/dbconfig/20220420-144352-kormat.json [14:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:56] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): updating wmf-puppet-dashboard for keystone authentication support (codwf1dev) [14:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1174.eqiad.wmnet with reason: Rebooting for T303174 [14:44:38] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1174.eqiad.wmnet with reason: Rebooting for T303174 [14:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:43] !log kormat@cumin1001 dbctl commit (dc=all): 'db1174 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25730 and previous config saved to /var/cache/conftool/dbconfig/20220420-144443-kormat.json [14:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:02] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25731 and previous config saved to /var/cache/conftool/dbconfig/20220420-144501-kormat.json [14:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:05] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1033.eqiad.wmnet with reason: Rebooting for T303174 [14:45:07] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1033.eqiad.wmnet with reason: Rebooting for T303174 [14:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:12] !log kormat@cumin1001 dbctl commit (dc=all): 'es1033 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25732 and previous config saved to /var/cache/conftool/dbconfig/20220420-144511-kormat.json [14:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:51] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1034.eqiad.wmnet with reason: Rebooting for T303174 [14:45:52] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1034.eqiad.wmnet with reason: Rebooting for T303174 [14:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:55] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): updating wmf-puppet-dashboard for keystone authentication support (codwf1dev) (duration: 01m 59s) [14:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:58] !log kormat@cumin1001 dbctl commit (dc=all): 'es1034 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25733 and previous config saved to /var/cache/conftool/dbconfig/20220420-144557-kormat.json [14:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:08] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1022.eqiad.wmnet with reason: Rebooting for T303174 [14:46:10] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1022.eqiad.wmnet with reason: Rebooting for T303174 [14:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:16] !log kormat@cumin1001 dbctl commit (dc=all): 'es1022 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25734 and previous config saved to /var/cache/conftool/dbconfig/20220420-144615-kormat.json [14:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:51] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for AAssaf - https://phabricator.wikimedia.org/T306437 (10Volans) @Jdforrester-WMF I can check what's the difference in Gerrit, it depends on the repositories I guess. Do they have an `@wikimedia.org` email account? As per https://wikitech.wikimedia.org/w... [14:47:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [14:47:25] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [14:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:30] !log kormat@cumin1001 dbctl commit (dc=all): 'es1025 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25735 and previous config saved to /var/cache/conftool/dbconfig/20220420-144730-kormat.json [14:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:39] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for AAssaf - https://phabricator.wikimedia.org/T306437 (10Jdforrester-WMF) >>! In T306437#7868290, @Volans wrote: > @Jdforrester-WMF I can check what's the difference in Gerrit, it depends on the repositories I guess. The relevant group is https://gerrit... [14:50:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25737 and previous config saved to /var/cache/conftool/dbconfig/20220420-145057-kormat.json [14:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:02] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [14:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:24] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6] (dev): updating wmf-puppet-dashboard for keystone authentication support (codfw1dev) [14:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:52] (03Abandoned) 10Arlolra: Commit changes from update --no-dev before bumping parsoid [vendor] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/784340 (owner: 10Arlolra) [14:51:59] (03Abandoned) 10Arlolra: Bump parsoid to 0.16.0-a6 [vendor] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/784341 (https://phabricator.wikimedia.org/T305641) (owner: 10Arlolra) [14:52:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1177 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25738 and previous config saved to /var/cache/conftool/dbconfig/20220420-145223-kormat.json [14:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:28] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6] (dev): updating wmf-puppet-dashboard for keystone authentication support (codfw1dev) (duration: 02m 03s) [14:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:37] (03CR) 10David Caro: DONOTMERGE: skeleteon for the replicaconfig service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780853 (owner: 10David Caro) [14:54:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1178.eqiad.wmnet with reason: Rebooting for T303174 [14:54:49] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1178.eqiad.wmnet with reason: Rebooting for T303174 [14:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db1178 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25739 and previous config saved to /var/cache/conftool/dbconfig/20220420-145454-kormat.json [14:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:13] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: updating wmf-puppet-dashboard for keystone authentication support T274666 (eqiad1) [14:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:18] T274666: Add keystone auth middleware to the puppet enc api - https://phabricator.wikimedia.org/T274666 [14:55:40] (03PS1) 10Ayounsi: Fix for cable termination not being updated [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784694 (https://phabricator.wikimedia.org/T259166) [14:55:47] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for AAssaf - https://phabricator.wikimedia.org/T306437 (10Volans) >>! In T306437#7868299, @Jdforrester-WMF wrote: >> Do they have an `@wikimedia.org` email account? As per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_group we us... [14:56:58] PROBLEM - Disk space on ml-staging-ctrl2001 is CRITICAL: DISK CRITICAL - free space: / 1124 MB (5% inode=95%): /tmp 1124 MB (5% inode=95%): /var/tmp 1124 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2001&var-datasource=codfw+prometheus/ops [14:57:07] (03CR) 10Ayounsi: "Tested in netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784694 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [14:57:11] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for AAssaf - https://phabricator.wikimedia.org/T306437 (10Volans) LDAP `wmf` group granted for `aassaf`. [14:58:36] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10Volans) >>! In T306518#7868065, @TheresNoTime wrote: > Hey @Volans, I'm already in the `ldap/nda` group from previous volunteer work :-) I believe the only reason for this request wa... [14:59:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25740 and previous config saved to /var/cache/conftool/dbconfig/20220420-145915-kormat.json [14:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:06] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25741 and previous config saved to /var/cache/conftool/dbconfig/20220420-150005-kormat.json [15:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:16] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: updating wmf-puppet-dashboard for keystone authentication support T274666 (eqiad1) (duration: 05m 03s) [15:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:21] T274666: Add keystone auth middleware to the puppet enc api - https://phabricator.wikimedia.org/T274666 [15:00:32] 10SRE, 10LDAP-Access-Requests, 10SRE Observability (FY2021/2022-Q4): Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10Volans) As clarified in the related task above, granted `ldap/wmf` to `uid=maryyang`. [15:01:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25742 and previous config saved to /var/cache/conftool/dbconfig/20220420-150119-ladsgroup.json [15:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:01:49] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1174.eqiad.wmnet with reason: Rebooting for T303174 [15:01:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1174.eqiad.wmnet with reason: Rebooting for T303174 [15:01:50] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1033.eqiad.wmnet with reason: Rebooting for T303174 [15:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:52] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1033.eqiad.wmnet with reason: Rebooting for T303174 [15:01:52] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1034.eqiad.wmnet with reason: Rebooting for T303174 [15:01:54] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1034.eqiad.wmnet with reason: Rebooting for T303174 [15:01:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1022.eqiad.wmnet with reason: Rebooting for T303174 [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:56] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1022.eqiad.wmnet with reason: Rebooting for T303174 [15:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:04] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10fgiunchedi) >>! In T306424#7867892, @Jgiannelos wrote: > Sounds good I've begun a cop... [15:04:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org [15:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:47] (03PS1) 10Ssingh: P:wikidough: add a check to ensure service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/784697 [15:04:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [15:04:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [15:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:16] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10TheresNoTime) >>! In T306518#7868350, @Volans wrote: >>>! In T306518#7868065, @TheresNoTime wrote: >> Hey @Volans, I'm already in the `ldap/nda` group from previous volunteer work :-... [15:05:22] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) Thanks @fgiunchedi [15:05:40] !log kormat@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25743 and previous config saved to /var/cache/conftool/dbconfig/20220420-150539-kormat.json [15:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:41] (03PS2) 10Ssingh: P:wikidough: add a check to ensure service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/784697 [15:08:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org [15:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:07] !log kormat@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25744 and previous config saved to /var/cache/conftool/dbconfig/20220420-150806-kormat.json [15:08:08] (03PS1) 10Volans: admin: add LDAP only accounts: aassaf, maryyang [puppet] - 10https://gerrit.wikimedia.org/r/784698 (https://phabricator.wikimedia.org/T306437) [15:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:25] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34916/console" [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [15:14:20] !log kormat@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25745 and previous config saved to /var/cache/conftool/dbconfig/20220420-151419-kormat.json [15:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:10] !log kormat@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25746 and previous config saved to /var/cache/conftool/dbconfig/20220420-151509-kormat.json [15:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25747 and previous config saved to /var/cache/conftool/dbconfig/20220420-151625-ladsgroup.json [15:16:25] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [15:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:15] (03CR) 10Volans: "LGTM, one question inline to be safer" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/784683 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [15:17:41] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784694 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [15:18:08] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:18:46] !log installing wireshark security updates [15:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:18] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10Volans) >>! In T306518#7868421, @TheresNoTime wrote: > And yes I did (`starling-ctr@wikimedia`) :-) Perfect, then you just need to provide me an LDAP (wikitech) account associated w... [15:19:58] (03CR) 10Vgutierrez: [C: 03+1] "it looks good as a PoC, I'm not 100% sold on how we are storing IPv4 addresses as it doesn't match what we feed to varnish via X-Client-IP" [puppet] - 10https://gerrit.wikimedia.org/r/784309 (owner: 10CDanis) [15:20:29] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1033.eqiad.wmnet with reason: Rebooting for T303174 [15:20:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1033.eqiad.wmnet with reason: Rebooting for T303174 [15:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:35] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1022.eqiad.wmnet with reason: Rebooting for T303174 [15:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:36] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1022.eqiad.wmnet with reason: Rebooting for T303174 [15:20:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [15:20:39] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [15:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:43] !log kormat@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25748 and previous config saved to /var/cache/conftool/dbconfig/20220420-152043-kormat.json [15:20:44] !log kormat@cumin1001 dbctl commit (dc=all): 'es1025 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25749 and previous config saved to /var/cache/conftool/dbconfig/20220420-152044-kormat.json [15:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:21] (03CR) 10Elukey: [C: 03+2] ml-services: update editquality predictor image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/784668 (https://phabricator.wikimedia.org/T306501) (owner: 10Kevin Bazira) [15:22:45] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10TheresNoTime) Oh! I've just changed the email for my [[ https://ldap.toolforge.org/user/samtar | current LDAP account ]] (`samtar`) — hopefully that works :) [15:22:53] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) @fgiunchedi is there a mitigation for the underlying object DB issue in th... [15:23:11] !log kormat@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25750 and previous config saved to /var/cache/conftool/dbconfig/20220420-152310-kormat.json [15:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:31] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1007.eqiad.wmnet with OS bullseye [15:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:15] !log kormat@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25751 and previous config saved to /var/cache/conftool/dbconfig/20220420-152414-kormat.json [15:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:12] !log kormat@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25752 and previous config saved to /var/cache/conftool/dbconfig/20220420-152611-kormat.json [15:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25753 and previous config saved to /var/cache/conftool/dbconfig/20220420-152923-kormat.json [15:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:25] (03CR) 10CDanis: Proof of concept for haproxy statistics tracking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784309 (owner: 10CDanis) [15:31:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25754 and previous config saved to /var/cache/conftool/dbconfig/20220420-153130-ladsgroup.json [15:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:35] (03PS1) 10Elukey: Add four new k8s worker nodes to ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/784701 (https://phabricator.wikimedia.org/T306545) [15:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25755 and previous config saved to /var/cache/conftool/dbconfig/20220420-153312-ladsgroup.json [15:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:19] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:33:20] (03PS2) 10Elukey: Add four new k8s worker nodes to ml-serve-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/784701 (https://phabricator.wikimedia.org/T306545) [15:35:17] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10Volans) Granted `ldap/wmf` to `uid= samtar`, revoked pre-existing `ldap/nda` one as they can't coexists on the same account. Don't worry if/when the contract will be over you can re-... [15:35:47] !log kormat@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25756 and previous config saved to /var/cache/conftool/dbconfig/20220420-153547-kormat.json [15:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/784698 (https://phabricator.wikimedia.org/T306437) (owner: 10Volans) [15:37:02] (03PS2) 10Ayounsi: wmf-netbox: remove the need for fetch_device_circuits [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/784683 (https://phabricator.wikimedia.org/T259166) [15:37:49] (03CR) 10Ayounsi: wmf-netbox: remove the need for fetch_device_circuits (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/784683 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [15:38:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25757 and previous config saved to /var/cache/conftool/dbconfig/20220420-153814-kormat.json [15:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:48] PROBLEM - Disk space on ml-staging-ctrl2001 is CRITICAL: DISK CRITICAL - free space: / 1117 MB (5% inode=95%): /tmp 1117 MB (5% inode=95%): /var/tmp 1117 MB (5% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2001&var-datasource=codfw+prometheus/ops [15:39:19] !log kormat@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25758 and previous config saved to /var/cache/conftool/dbconfig/20220420-153918-kormat.json [15:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:36] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784702 (https://phabricator.wikimedia.org/T304076) (owner: 10WMDE-Fisch) [15:40:36] 10SRE, 10ops-esams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10RobH) Emailed into Wim, the guy we ordered it from: > > Hello, > > We purchased Atlas SN 000024D23DF4 in June 2017. I don't recall if the support on these is 2,3, or 5 years. > > Recently it just died on... [15:41:16] !log kormat@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25759 and previous config saved to /var/cache/conftool/dbconfig/20220420-154115-kormat.json [15:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:35] (03CR) 10Volans: [C: 03+2] admin: add LDAP only accounts: aassaf, maryyang [puppet] - 10https://gerrit.wikimedia.org/r/784698 (https://phabricator.wikimedia.org/T306437) (owner: 10Volans) [15:41:44] 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Move device attributes - https://phabricator.wikimedia.org/T259166 (10ayounsi) I fixed the prod ones with in nbshell: `lang=python source_device = Device.objects.get(name="mr1-ulsfo-old") destination_device = Device.objects.get(name=... [15:41:54] !log hnowlan@deploy1002 Started deploy [restbase/deploy@0205f1d]: Bump mediawiki-title to 0.7.5 [15:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:21] (03CR) 10Ayounsi: [C: 03+2] Fix for cable termination not being updated [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784694 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [15:42:51] (03PS1) 10Elukey: Add ml-serve100[5-8] to the ml-serve-eqiad k8s BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) [15:42:59] (03Merged) 10jenkins-bot: Fix for cable termination not being updated [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784694 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [15:43:01] (03PS1) 10Volans: admin: update samtar account [puppet] - 10https://gerrit.wikimedia.org/r/784704 (https://phabricator.wikimedia.org/T306518) [15:43:22] (03PS2) 10Elukey: Add ml-serve100[5-8] to the ml-serve-eqiad k8s BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) [15:43:34] (03CR) 10Volans: "Bare in mind that there is also an absented 'samtar' shell account. I'm not sure if the two might conflict." [puppet] - 10https://gerrit.wikimedia.org/r/784704 (https://phabricator.wikimedia.org/T306518) (owner: 10Volans) [15:44:27] !log kormat@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25760 and previous config saved to /var/cache/conftool/dbconfig/20220420-154427-kormat.json [15:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:59] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4): Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10Volans) 05Open→03Resolved a:03Volans Patch merged, resolving. [15:46:07] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for AAssaf - https://phabricator.wikimedia.org/T306437 (10Volans) 05Open→03Resolved a:03Volans Patch merged, resolving. [15:46:30] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/784683 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [15:46:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25761 and previous config saved to /var/cache/conftool/dbconfig/20220420-154635-ladsgroup.json [15:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:46:50] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10fgiunchedi) >>! In T306424#7868538, @Jgiannelos wrote: > @fgiunchedi is there a mitiga... [15:47:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:47:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25762 and previous config saved to /var/cache/conftool/dbconfig/20220420-154734-ladsgroup.json [15:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:53] (03PS1) 10Btullis: Disable analytics for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/784726 (https://phabricator.wikimedia.org/T299910) [15:48:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25763 and previous config saved to /var/cache/conftool/dbconfig/20220420-154817-ladsgroup.json [15:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:59] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Samtar (TheresNoTime) - https://phabricator.wikimedia.org/T306518 (10Volans) p:05Triage→03Medium [15:49:09] 10SRE: Allow Wikimedia Maps usage on a private project for an university. - https://phabricator.wikimedia.org/T306467 (10Bugreporter) If you only want to use the map service in a private project, it is better to set up a proxy to maps.wikimedia.org instead. [15:50:11] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: remove the need for fetch_device_circuits [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/784683 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [15:50:51] !log kormat@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25764 and previous config saved to /var/cache/conftool/dbconfig/20220420-155051-kormat.json [15:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:18] !log kormat@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25765 and previous config saved to /var/cache/conftool/dbconfig/20220420-155318-kormat.json [15:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:22] !log kormat@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25766 and previous config saved to /var/cache/conftool/dbconfig/20220420-155422-kormat.json [15:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:02] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1007.eqiad.wmnet with OS bullseye [15:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:20] !log kormat@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25767 and previous config saved to /var/cache/conftool/dbconfig/20220420-155619-kormat.json [15:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:41] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1007.eqiad.wmnet with OS bullseye [15:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:29] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@0205f1d]: Bump mediawiki-title to 0.7.5 (duration: 15m 35s) [15:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:28] (03CR) 10Btullis: [C: 03+2] Disable analytics for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/784726 (https://phabricator.wikimedia.org/T299910) (owner: 10Btullis) [16:03:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25768 and previous config saved to /var/cache/conftool/dbconfig/20220420-160322-ladsgroup.json [16:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:38] (03Merged) 10jenkins-bot: Disable analytics for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/784726 (https://phabricator.wikimedia.org/T299910) (owner: 10Btullis) [16:03:43] 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Move device attributes - https://phabricator.wikimedia.org/T259166 (10ayounsi) 05Open→03Resolved [16:03:46] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10ayounsi) [16:03:52] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/784669 (https://phabricator.wikimedia.org/T306057) (owner: 10Ottomata) [16:05:00] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Generated Data Platform, 10Patch-For-Review: Request to grant cparle and mfossati login to an-airflow1003.eqiad.wmne - https://phabricator.wikimedia.org/T306057 (10Volans) I've +1ed the patch, @Ottomata feel free to merge whenever works for you. [16:05:54] (03PS4) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [16:06:28] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:09:26] !log kormat@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25769 and previous config saved to /var/cache/conftool/dbconfig/20220420-160926-kormat.json [16:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:36] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1007.eqiad.wmnet with reason: host reimage [16:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:24] !log kormat@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25770 and previous config saved to /var/cache/conftool/dbconfig/20220420-161123-kormat.json [16:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [16:12:37] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [16:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:00] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1007.eqiad.wmnet with reason: host reimage [16:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:27] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Rebooting db1158 T303174 [16:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:30] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Rebooting db1158 T303174 [16:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1158.eqiad.wmnet with reason: Rebooting for T303174 [16:13:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1158.eqiad.wmnet with reason: Rebooting for T303174 [16:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db1158 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25771 and previous config saved to /var/cache/conftool/dbconfig/20220420-161353-kormat.json [16:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:53] !log kormat@cumin1001 dbctl commit (dc=all): 'Change es3 'master' to es1031 T303174', diff saved to https://phabricator.wikimedia.org/P25772 and previous config saved to /var/cache/conftool/dbconfig/20220420-161453-kormat.json [16:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:04] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1028.eqiad.wmnet with reason: Rebooting for T303174 [16:15:06] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1028.eqiad.wmnet with reason: Rebooting for T303174 [16:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:11] !log kormat@cumin1001 dbctl commit (dc=all): 'es1028 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25773 and previous config saved to /var/cache/conftool/dbconfig/20220420-161511-kormat.json [16:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:15] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25774 and previous config saved to /var/cache/conftool/dbconfig/20220420-161828-ladsgroup.json [16:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:39] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:19:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25775 and previous config saved to /var/cache/conftool/dbconfig/20220420-161914-kormat.json [16:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:26] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [16:19:28] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [16:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:47] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10netops: Spicerack: add network devices support - https://phabricator.wikimedia.org/T306552 (10ayounsi) p:05Triage→03Medium [16:22:19] (03PS1) 10Jforrester: [Beta Cluster] Correct Wikifunctions service host names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784729 (https://phabricator.wikimedia.org/T284162) [16:22:33] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Generated Data Platform, 10Patch-For-Review: Request to grant cparle and mfossati login to an-airflow1003.eqiad.wmne - https://phabricator.wikimedia.org/T306057 (10Ottomata) Thanks got caught in meetings, will do soon! [16:22:54] (03PS2) 10Ottomata: Add cparel and mfossati to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/784669 (https://phabricator.wikimedia.org/T306057) [16:24:47] jouncebot: next [16:24:47] In 1 hour(s) and 35 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T1800) [16:24:47] In 1 hour(s) and 35 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T1800) [16:24:56] OK, I'll slip this out. [16:25:16] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Correct Wikifunctions service host names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784729 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [16:25:56] (03Merged) 10jenkins-bot: [Beta Cluster] Correct Wikifunctions service host names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784729 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [16:28:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:28:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:10] 10SRE, 10ops-eqiad: elastic1097 Failed DIMM slot A2 - https://phabricator.wikimedia.org/T306462 (10Cmjohnson) 05Open→03Invalid duplicate [16:33:55] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10RobH) 05Open→03Resolved fixed the serial/power by just removing from old and duplicating on new, as it used all the old cables. [16:33:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:34:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:02] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10RobH) [16:34:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:09] (03PS3) 10Ottomata: Add cparle and mfossati to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/784669 (https://phabricator.wikimedia.org/T306057) [16:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:19] !log kormat@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25776 and previous config saved to /var/cache/conftool/dbconfig/20220420-163418-kormat.json [16:34:21] (03PS1) 10Gergő Tisza: [beta] Enable Growth campaigns on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784730 (https://phabricator.wikimedia.org/T305015) [16:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:15] (03PS5) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [16:35:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:35:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [16:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25777 and previous config saved to /var/cache/conftool/dbconfig/20220420-163537-ladsgroup.json [16:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:35:50] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:36:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: memory error for elastic1097 - https://phabricator.wikimedia.org/T306449 (10Cmjohnson) UEFI0339: The Dual Inline Memory Module (DIMM) in the memory slot A2 is disabled because of initialization errors caused by uncorrectab... [16:37:12] (03PS2) 10Gergő Tisza: [beta] Enable Growth campaigns on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784730 (https://phabricator.wikimedia.org/T303785) [16:38:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: memory error for elastic1097 - https://phabricator.wikimedia.org/T306449 (10Cmjohnson) Dell request for new DIMM place, You have successfully submitted request SR1091181415. [16:40:07] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1068 - https://phabricator.wikimedia.org/T306215 (10Cmjohnson) 05Open→03Resolved this task was created when I fixed the raid and re-imaged it. [16:41:23] (03CR) 10Ottomata: [C: 03+2] Add cparle and mfossati to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/784669 (https://phabricator.wikimedia.org/T306057) (owner: 10Ottomata) [16:42:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED [16:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:52] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-worker1144.mgmt.eqiad.wmnet with reboot policy FORCED [16:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:02] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10thcipriani) In our team meeting we talked about the possibility of migrating git-fat (600 lines of python2 → python3) vs. making the neede... [16:43:13] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1144.mgmt.eqiad.wmnet with reboot policy FORCED [16:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:16] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1143.mgmt.eqiad.wmnet with reboot policy FORCED [16:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:36] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Generated Data Platform, 10Patch-For-Review: Request to grant cparle and mfossati login to an-airflow1003.eqiad.wmne - https://phabricator.wikimedia.org/T306057 (10Ottomata) 05Open→03Resolved a:03Ottomata [16:47:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25778 and previous config saved to /var/cache/conftool/dbconfig/20220420-164749-ladsgroup.json [16:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:48:03] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:49:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25779 and previous config saved to /var/cache/conftool/dbconfig/20220420-164922-kormat.json [16:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:49] !log jynus@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host backup1007.eqiad.wmnet with OS bullseye [16:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:57] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1028.eqiad.wmnet with reason: Rebooting for T303174 [16:50:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1028.eqiad.wmnet with reason: Rebooting for T303174 [16:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:07] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [16:51:09] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [16:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:52] (03CR) 10Gergő Tisza: [C: 03+2] "Self-merge, beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784730 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [16:54:36] (03Merged) 10jenkins-bot: [beta] Enable Growth campaigns on all beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784730 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [16:57:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) vlans updated [16:59:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:59:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:52] RECOVERY - Disk space on ml-staging-ctrl2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2001&var-datasource=codfw+prometheus/ops [17:01:20] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [17:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:59] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [17:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:05] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [17:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:12] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [17:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:26] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [17:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:02:55] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [17:02:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25780 and previous config saved to /var/cache/conftool/dbconfig/20220420-170254-ladsgroup.json [17:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:07] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [17:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:18] (03CR) 10BBlack: [C: 03+1] "I like this, and I think it's useful even in current form (note: I did not review for basic python syntax/correctness, etc, just the overa" [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [17:03:36] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [17:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:27] !log kormat@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25781 and previous config saved to /var/cache/conftool/dbconfig/20220420-170426-kormat.json [17:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) [17:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25782 and previous config saved to /var/cache/conftool/dbconfig/20220420-170804-ladsgroup.json [17:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:09] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:12:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1001.mgmt.eqiad.wmnet with reboot policy FORCED [17:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:24] RECOVERY - Disk space on ml-staging-ctrl2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [17:16:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1005.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1003.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1004.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1006.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1009.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1010.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1011.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1008.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1007.mgmt.eqiad.wmnet with reboot policy FORCED [17:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:25] (03PS1) 10Andrew Bogott: Make cloudcephmon200[5,6] into cloudcephmon nodes. [puppet] - 10https://gerrit.wikimedia.org/r/784736 (https://phabricator.wikimedia.org/T304881) [17:16:26] (03PS1) 10Andrew Bogott: Make new hosts cloudservices200[4,5] into cloudservices nodes [puppet] - 10https://gerrit.wikimedia.org/r/784737 (https://phabricator.wikimedia.org/T304881) [17:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:29] (03PS1) 10Andrew Bogott: Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) [17:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:20] (03CR) 10jerkins-bot: [V: 04-1] Make new hosts cloudservices200[4,5] into cloudservices nodes [puppet] - 10https://gerrit.wikimedia.org/r/784737 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:17:39] (03CR) 10jerkins-bot: [V: 04-1] Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:18:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25783 and previous config saved to /var/cache/conftool/dbconfig/20220420-171759-ladsgroup.json [17:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:45] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudcephmon200[5,6] into cloudcephmon nodes. [puppet] - 10https://gerrit.wikimedia.org/r/784736 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:22:57] (03PS2) 10Andrew Bogott: Make new hosts cloudservices200[4,5] into cloudservices nodes [puppet] - 10https://gerrit.wikimedia.org/r/784737 (https://phabricator.wikimedia.org/T304881) [17:22:59] (03PS2) 10Andrew Bogott: Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) [17:23:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25784 and previous config saved to /var/cache/conftool/dbconfig/20220420-172309-ladsgroup.json [17:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:29] (03CR) 10jerkins-bot: [V: 04-1] Make new hosts cloudservices200[4,5] into cloudservices nodes [puppet] - 10https://gerrit.wikimedia.org/r/784737 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:23:48] (03CR) 10jerkins-bot: [V: 04-1] Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:25:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) Tyler, Brennen, added you here per our meeting today. So that you can see status of the physical hos... [17:26:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1001.mgmt.eqiad.wmnet with reboot policy FORCED [17:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1028.eqiad.wmnet with reason: Rebooting for T303174 [17:26:21] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1028.eqiad.wmnet with reason: Rebooting for T303174 [17:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:38] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [17:26:40] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1025.eqiad.wmnet with reason: Rebooting for T303174 [17:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:01] (03PS3) 10Andrew Bogott: Make new hosts cloudservices200[4,5] into cloudservices nodes [puppet] - 10https://gerrit.wikimedia.org/r/784737 (https://phabricator.wikimedia.org/T304881) [17:27:03] (03PS3) 10Andrew Bogott: Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) [17:27:05] (03PS1) 10Andrew Bogott: Add hiera settings for cloudcephmon200[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/784739 (https://phabricator.wikimedia.org/T304881) [17:27:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1012.mgmt.eqiad.wmnet with reboot policy FORCED [17:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:52] (03CR) 10jerkins-bot: [V: 04-1] Make new hosts cloudservices200[4,5] into cloudservices nodes [puppet] - 10https://gerrit.wikimedia.org/r/784737 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:28:15] (03CR) 10jerkins-bot: [V: 04-1] Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:29:21] (03CR) 10Andrew Bogott: [C: 03+2] Add hiera settings for cloudcephmon200[5,6] [puppet] - 10https://gerrit.wikimedia.org/r/784739 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [17:30:04] !log kormat@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25785 and previous config saved to /var/cache/conftool/dbconfig/20220420-173004-kormat.json [17:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1008.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:51] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1005.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1011.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1009.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1002.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1006.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1010.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1004.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1007.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:12] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab-runner2001.codfw.wmnet with reason: reimage [17:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:15] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab-runner2001.codfw.wmnet with reason: reimage [17:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:40] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1003.mgmt.eqiad.wmnet with reboot policy FORCED [17:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:08] !log kormat@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25786 and previous config saved to /var/cache/conftool/dbconfig/20220420-173207-kormat.json [17:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25787 and previous config saved to /var/cache/conftool/dbconfig/20220420-173304-ladsgroup.json [17:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:33:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1013.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1017.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1015.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1014.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1020.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1022.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1021.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1016.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1018.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1019.mgmt.eqiad.wmnet with reboot policy FORCED [17:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [17:33:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [17:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:34:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25788 and previous config saved to /var/cache/conftool/dbconfig/20220420-173405-ladsgroup.json [17:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:43] (03PS1) 10Dzahn: DHCP: make protected gitlab-runners use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/784741 (https://phabricator.wikimedia.org/T297659) [17:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:56] (03CR) 10Dzahn: [C: 03+2] DHCP: make protected gitlab-runners use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/784741 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [17:36:09] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for AAssaf - https://phabricator.wikimedia.org/T306437 (10AAssaf-WMF) Thanks! [17:38:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25789 and previous config saved to /var/cache/conftool/dbconfig/20220420-173814-ladsgroup.json [17:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:19] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1019.mgmt.eqiad.wmnet with reboot policy FORCED [17:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:23] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1015.mgmt.eqiad.wmnet with reboot policy FORCED [17:39:25] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1020.mgmt.eqiad.wmnet with reboot policy FORCED [17:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1012.mgmt.eqiad.wmnet with reboot policy FORCED [17:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1023.mgmt.eqiad.wmnet with reboot policy FORCED [17:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:06] 10SRE, 10ops-codfw: mc2031.mgmt looks down from icinga's perspective - https://phabricator.wikimedia.org/T306438 (10Papaul) 05Open→03Resolved @elukey all good it was a cable problem [17:45:08] !log kormat@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25790 and previous config saved to /var/cache/conftool/dbconfig/20220420-174508-kormat.json [17:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1015.mgmt.eqiad.wmnet with reboot policy FORCED [17:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1016.mgmt.eqiad.wmnet with reboot policy FORCED [17:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1022.mgmt.eqiad.wmnet with reboot policy FORCED [17:46:23] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1013.mgmt.eqiad.wmnet with reboot policy FORCED [17:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1019.mgmt.eqiad.wmnet with reboot policy FORCED [17:46:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1021.mgmt.eqiad.wmnet with reboot policy FORCED [17:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1014.mgmt.eqiad.wmnet with reboot policy FORCED [17:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1017.mgmt.eqiad.wmnet with reboot policy FORCED [17:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:52] RECOVERY - Host mc2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [17:47:12] !log kormat@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25791 and previous config saved to /var/cache/conftool/dbconfig/20220420-174711-kormat.json [17:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:15] (03PS1) 10Vivian Rook: Add Vivian Rook to icinga [puppet] - 10https://gerrit.wikimedia.org/r/784742 [17:47:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1018.mgmt.eqiad.wmnet with reboot policy FORCED [17:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:49:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host parse1024.mgmt.eqiad.wmnet with reboot policy FORCED [17:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:06] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host parse1019.mgmt.eqiad.wmnet with reboot policy FORCED [17:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:18] (03PS28) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [17:52:04] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [17:53:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25792 and previous config saved to /var/cache/conftool/dbconfig/20220420-175319-ladsgroup.json [17:53:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [17:53:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [17:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:53:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25793 and previous config saved to /var/cache/conftool/dbconfig/20220420-175327-ladsgroup.json [17:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1023.mgmt.eqiad.wmnet with reboot policy FORCED [17:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:10] (03PS1) 10Gergő Tisza: [beta] Update $wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784745 (https://phabricator.wikimedia.org/T303785) [18:00:04] jeena and brennen: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T1800). [18:00:04] jeena and brennen: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T1800) [18:00:12] !log kormat@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25794 and previous config saved to /var/cache/conftool/dbconfig/20220420-180012-kormat.json [18:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:17] !log kormat@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25795 and previous config saved to /var/cache/conftool/dbconfig/20220420-180215-kormat.json [18:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:25] (03PS1) 10Jeena Huneidi: group1 wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784746 [18:02:27] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784746 (owner: 10Jeena Huneidi) [18:02:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host parse1024.mgmt.eqiad.wmnet with reboot policy FORCED [18:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:47] (03CR) 10Gergő Tisza: [C: 03+2] [beta] Update $wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784745 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [18:02:49] (03CR) 10CDanis: [C: 03+2] Enable profile::auto_restarts::service for klaxon gunicorn webapp [puppet] - 10https://gerrit.wikimedia.org/r/767516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:03:05] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784746 (owner: 10Jeena Huneidi) [18:03:26] (03Merged) 10jenkins-bot: [beta] Update $wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784745 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [18:03:30] (03Abandoned) 10CDanis: upload VCL: Only apply requestctl rules to external clients [puppet] - 10https://gerrit.wikimedia.org/r/779064 (owner: 10CDanis) [18:04:25] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.8 refs T305214 [18:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:30] T305214: 1.39.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T305214 [18:05:17] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.8 refs T305214 (duration: 00m 51s) [18:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install parse100[01-24] - https://phabricator.wikimedia.org/T299573 (10Cmjohnson) [18:10:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:10:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:16] !log kormat@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25796 and previous config saved to /var/cache/conftool/dbconfig/20220420-181515-kormat.json [18:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:50] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) [18:17:21] !log kormat@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25797 and previous config saved to /var/cache/conftool/dbconfig/20220420-181720-kormat.json [18:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:08] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) [18:31:41] 10SRE, 10LDAP-Access-Requests, 10SRE Observability (FY2021/2022-Q4): Grant Access to ldap/wmf for "Mary Yang" - https://phabricator.wikimedia.org/T306225 (10dr0ptp4kt) Thanks @Volans, thanks @Jdforrester-WMF. [18:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:34:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25798 and previous config saved to /var/cache/conftool/dbconfig/20220420-183419-ladsgroup.json [18:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:36:07] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab-runner2001.codfw.wmnet with reason: reimage [18:36:09] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab-runner2001.codfw.wmnet with reason: reimage [18:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:16] !log reimaging gitlab-runner2021.codfw.wmnet [18:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:45] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:49:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25799 and previous config saved to /var/cache/conftool/dbconfig/20220420-184925-ladsgroup.json [18:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25800 and previous config saved to /var/cache/conftool/dbconfig/20220420-185341-ladsgroup.json [18:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:04:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25801 and previous config saved to /var/cache/conftool/dbconfig/20220420-190429-ladsgroup.json [19:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25802 and previous config saved to /var/cache/conftool/dbconfig/20220420-190846-ladsgroup.json [19:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25803 and previous config saved to /var/cache/conftool/dbconfig/20220420-191934-ladsgroup.json [19:19:36] !log puppetmaster - cleaning cert for gitlab-runner2001, signing new request [19:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:19] (03PS1) 10Hashar: devtools: update fqdn for deploy-1004 instance [puppet] - 10https://gerrit.wikimedia.org/r/784750 (https://phabricator.wikimedia.org/T299997) [19:20:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [19:20:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [19:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25804 and previous config saved to /var/cache/conftool/dbconfig/20220420-192029-ladsgroup.json [19:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:56] (03CR) 10Hashar: [V: 03+1] "I have cherry picked the change on puppetmaster-1001.devtools and that "fixed" the puppet agent on phabricator-stage-1001 :)" [puppet] - 10https://gerrit.wikimedia.org/r/784750 (https://phabricator.wikimedia.org/T299997) (owner: 10Hashar) [19:23:40] (03CR) 10Dzahn: [C: 03+2] "oooh.. THIS! yea, that changed in cloud VPS. at some point the DNS entries vanished, ran into this in some other context too. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/784750 (https://phabricator.wikimedia.org/T299997) (owner: 10Hashar) [19:23:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25805 and previous config saved to /var/cache/conftool/dbconfig/20220420-192354-ladsgroup.json [19:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25806 and previous config saved to /var/cache/conftool/dbconfig/20220420-192717-ladsgroup.json [19:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:29:36] (03PS1) 10Dzahn: deployment-prep: replace eqiad.wmflabs with eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/784753 [19:29:51] (03CR) 10Dzahn: "this made me do https://gerrit.wikimedia.org/r/c/operations/puppet/+/784753/" [puppet] - 10https://gerrit.wikimedia.org/r/784750 (https://phabricator.wikimedia.org/T299997) (owner: 10Hashar) [19:36:26] PROBLEM - Check systemd state on gitlab-runner2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service,docker-resource-monitor.service,ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25807 and previous config saved to /var/cache/conftool/dbconfig/20220420-193859-ladsgroup.json [19:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:39:30] (03PS1) 10Dzahn: cloudinfra: replace eqiad.wmflabs with eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/784755 [19:42:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25808 and previous config saved to /var/cache/conftool/dbconfig/20220420-194222-ladsgroup.json [19:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:50] !log gmodena@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [19:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:02] !log gmodena@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 01m 12s) [19:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:56:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25809 and previous config saved to /var/cache/conftool/dbconfig/20220420-195606-ladsgroup.json [19:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:57:26] (03PS1) 104nn1l2: Revert "fawiki: Change logo for 900K milestone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784708 [19:57:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25810 and previous config saved to /var/cache/conftool/dbconfig/20220420-195727-ladsgroup.json [19:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:34] (03PS1) 104nn1l2: Revert "fawiki: Change logo for 900K milestone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784709 [19:59:45] (03PS1) 104nn1l2: Revert "fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784710 [20:00:05] RoanKattouw and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T2000). [20:00:05] nn1l2 and nn1l2: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:53] hi [20:02:05] (03Abandoned) 104nn1l2: Revert "fawiki: Change logo for 900K milestone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784709 (owner: 104nn1l2) [20:05:19] Any deployer for this window? [20:06:54] urbanecm: hi, can you deploy today? [20:09:12] RoanKattouw: hi, what about you? [20:09:39] jouncebot: now [20:09:39] For the next 0 hour(s) and 50 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220420T2000) [20:10:10] hi nn1l2, I can do the backport but do you have anyone to CR your change? [20:10:21] !log gmodena@deploy1002 Started deploy [airflow-dags/research@b029f10]: (no justification provided) [20:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:27] !log gmodena@deploy1002 Finished deploy [airflow-dags/research@b029f10]: (no justification provided) (duration: 00m 06s) [20:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:38] what do you mean by CR? [20:10:52] review [20:11:10] since I don't really know anything about it [20:11:58] Using Mwdebug1001 extension? [20:12:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25811 and previous config saved to /var/cache/conftool/dbconfig/20220420-201217-ladsgroup.json [20:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:22] like if someone could sign off on the change (do a code review +1) [20:12:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:12:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25812 and previous config saved to /var/cache/conftool/dbconfig/20220420-201232-ladsgroup.json [20:12:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [20:12:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [20:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25813 and previous config saved to /var/cache/conftool/dbconfig/20220420-201240-ladsgroup.json [20:13:25] I also see it has a merge conflict so it needs to be rebased [20:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:06] It's just a revert to the original configuration. I will solve the rebase myself. Give me some minutes. [20:14:41] (03PS2) 104nn1l2: Revert "fawiki: Change logo for 900K milestone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784708 [20:17:10] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/784708 rebased [20:17:43] (03PS2) 104nn1l2: Revert "fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784710 [20:19:22] thanks I'll +2 now [20:20:45] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "fawiki: Change logo for 900K milestone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784708 (owner: 104nn1l2) [20:20:54] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784708 (owner: 104nn1l2) [20:21:25] (03Merged) 10jenkins-bot: Revert "fawiki: Change logo for 900K milestone" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784708 (owner: 104nn1l2) [20:23:03] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784710 (owner: 104nn1l2) [20:23:15] (03PS5) 10Juan90264: Create 'uploader' group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783918 (https://phabricator.wikimedia.org/T303577) [20:23:33] (03PS6) 10Juan90264: Create 'uploader' group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783918 (https://phabricator.wikimedia.org/T303577) [20:25:01] Hello [20:25:30] hello [20:26:27] sorry nn1l2 can you rebase https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/784710/ again? [20:26:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:45] (03PS3) 104nn1l2: Revert "fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784710 [20:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:53] done [20:27:00] (03CR) 10BryanDavis: [C: 03+1] "Seems fine to me. The only thing that I personally think would be better is moving all of this hiera config into Horizon where it can be u" [puppet] - 10https://gerrit.wikimedia.org/r/784753 (owner: 10Dzahn) [20:27:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25814 and previous config saved to /var/cache/conftool/dbconfig/20220420-202722-ladsgroup.json [20:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:56] (03PS1) 10RLazarus: varnish: Rename public_clouds.json to ipblock_cloud.json [puppet] - 10https://gerrit.wikimedia.org/r/784761 (https://phabricator.wikimedia.org/T305581) [20:29:02] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784710 (owner: 104nn1l2) [20:29:25] (03PS1) 10Ladsgroup: Add fix_img_major_mime_null_T306560.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/784762 (https://phabricator.wikimedia.org/T306560) [20:29:39] (03CR) 10Dzahn: deployment-prep: replace eqiad.wmflabs with eqiad1.wikimedia.cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784753 (owner: 10Dzahn) [20:29:48] (03CR) 10jerkins-bot: [V: 04-1] Add fix_img_major_mime_null_T306560.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/784762 (https://phabricator.wikimedia.org/T306560) (owner: 10Ladsgroup) [20:29:53] (03Merged) 10jenkins-bot: Revert "fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784710 (owner: 104nn1l2) [20:30:26] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34920/console" [puppet] - 10https://gerrit.wikimedia.org/r/784761 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [20:31:16] (03CR) 10Dzahn: "@Bryan would you say the same applies to 'cloudinfra' project or that is a different case?" [puppet] - 10https://gerrit.wikimedia.org/r/784753 (owner: 10Dzahn) [20:31:55] (03PS2) 10Jforrester: [Beta Cluster] LabsServices: Move to buster restbase host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766602 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [20:31:59] (03PS3) 10Jforrester: [Beta Cluster] LabsServices: Move to buster restbase host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766602 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [20:32:01] (03PS2) 10Ladsgroup: Add fix_img_major_mime_null_T306560.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/784762 (https://phabricator.wikimedia.org/T306560) [20:32:10] (03CR) 10Jforrester: "Is this good to deploy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766602 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [20:32:40] nn1l2: you should be able to check it out on mwdebug1001 now [20:32:56] (03PS4) 10Jforrester: [Beta Cluster] LabsServices: Use deployment-graphite01 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747494 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [20:33:02] (03CR) 10Jforrester: "Is this good to deploy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747494 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [20:33:32] let me know if everything looks good [20:34:00] : LGTM, please sync [20:34:13] 👍 [20:36:07] !log gitlab-runner2001 - mkdir /home/gitlab-runner (was: PANIC: mkdir /home/gitlab-runner: permission denied and other issues, trying if it's just the missing directory or more) T297659 [20:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:13] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [20:36:15] (03PS2) 10Jforrester: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto) [20:36:18] (03CR) 10Jforrester: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto) [20:36:40] (03CR) 10Ssingh: [V: 03+1] P:wikidough: add a check to ensure service has been restarted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [20:36:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:36:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:44] !log jhuneidi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:784708|Revert "fawiki: Change logo for 900K milestone"]] [[gerrit:784710|Revert "fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector)"]] (duration: 00m 57s) [20:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:06] Don't forget me [20:38:17] urbanecm [20:38:37] I am doing the backports today [20:38:43] !log jhuneidi@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:784710|Revert "fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector)"]] (duration: 00m 51s) [20:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:47] let me finish with this sync and I'll take a look at yours [20:38:52] I'm not deploying today Juan_90264. and for the future, i recommend showing up on the time -- it's easier to not forget that way ;) [20:39:02] and thanks jeena for doing the backports today [20:39:08] np [20:40:13] !log jhuneidi@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:784710|Revert "fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector)"]] (duration: 00m 50s) [20:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:37] Okay [20:42:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25815 and previous config saved to /var/cache/conftool/dbconfig/20220420-204227-ladsgroup.json [20:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:58] !log jhuneidi@deploy1002 Synchronized static/images/mobile/copyright/: Config: [[gerrit:784708|Revert "fawiki: Change logo for 900K milestone"]] (duration: 00m 49s) [20:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:12] !log jhuneidi@deploy1002 Synchronized static/images/project-logos/: Config: [[gerrit:784710|Revert "fawiki: Change wordmark & tagline (new Vector) and logo (legacy Vector)"]] (duration: 00m 51s) [20:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:15] (03PS2) 10RLazarus: varnish: Rename public_clouds.json to ipblock_cloud.json [puppet] - 10https://gerrit.wikimedia.org/r/784761 (https://phabricator.wikimedia.org/T305581) [20:46:23] nn1l2: should be all synced now [20:46:48] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS buster [20:46:49] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephmon2006-dev.codfw.wmnet with OS buster [20:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcephmon2005-dev.codfw... [20:47:00] Yes, Thanks! [20:47:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcephmon2006-dev.codfw... [20:47:32] you're welcome! [20:48:10] Juan_90264: is this your patch? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/783918 [20:48:41] (03CR) 10BryanDavis: [C: 03+1] deployment-prep: replace eqiad.wmflabs with eqiad1.wikimedia.cloud (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/784753 (owner: 10Dzahn) [20:50:15] (03CR) 10Jforrester: [C: 03+1] deployment-prep: replace eqiad.wmflabs with eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/784753 (owner: 10Dzahn) [20:50:52] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [20:51:05] jeena: Yes [20:51:24] can you give it a rebase? [20:52:04] (03PS3) 10Ssingh: P:wikidough: add a check to ensure service has been restarted [puppet] - 10https://gerrit.wikimedia.org/r/784697 [20:52:15] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:52:23] (03PS4) 10Andrew Bogott: Make new hosts cloudservices200[4,5] into cloudservices nodes [puppet] - 10https://gerrit.wikimedia.org/r/784737 (https://phabricator.wikimedia.org/T304881) [20:52:25] (03PS4) 10Andrew Bogott: Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) [20:53:29] (03CR) 10Andrew Bogott: [C: 03+2] Make new hosts cloudservices200[4,5] into cloudservices nodes [puppet] - 10https://gerrit.wikimedia.org/r/784737 (https://phabricator.wikimedia.org/T304881) (owner: 10Andrew Bogott) [20:55:47] (03CR) 10Ssingh: "(Updated and added a configurable critical time argument set by default to 24 hours after which the alert is raised from WARNING to CRITIC" [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [20:56:46] (03PS7) 10Juan90264: Create 'uploader' group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783918 (https://phabricator.wikimedia.org/T303577) [20:57:17] Jeena: Rebase done [20:57:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25816 and previous config saved to /var/cache/conftool/dbconfig/20220420-205732-ladsgroup.json [20:57:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [20:57:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [20:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:39] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:49] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783918 (https://phabricator.wikimedia.org/T303577) (owner: 10Juan90264) [20:59:01] (03Merged) 10jenkins-bot: Create 'uploader' group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783918 (https://phabricator.wikimedia.org/T303577) (owner: 10Juan90264) [21:00:31] Juan_90264: pulled to mwdebug1001, please confirm it is working as expected [21:01:38] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [21:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [21:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:51] (03PS3) 10RLazarus: varnish: Rename public_clouds.json to ipblock_cloud.json [puppet] - 10https://gerrit.wikimedia.org/r/784761 (https://phabricator.wikimedia.org/T305581) [21:02:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:02:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:05:06] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [21:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:07] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [21:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:07:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:52] (03CR) 10Dzahn: "When creating a new trusted runner (reimage) I ran into:" [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [21:08:27] jouncebot: nowandnext [21:08:27] No deployments scheduled for the next 8 hour(s) and 51 minute(s) [21:08:27] In 8 hour(s) and 51 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220421T0600) [21:08:35] I'm still doing a backport [21:09:12] Juan_90264: were you able to check if your change has applied correctly on mwdebug1001? [21:12:52] heh, I guess that's a no [21:12:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25817 and previous config saved to /var/cache/conftool/dbconfig/20220420-211255-ladsgroup.json [21:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:01] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:13:16] 🫥 [21:13:30] that is nothing in my terminal :) [21:13:38] hahaha excactly [21:13:51] well it's called "dotted line face" [21:14:07] anyway I should revert I guess? [21:14:10] +1 [21:14:30] sigh [21:15:08] thanks jeena! [21:15:15] np [21:17:08] (03PS1) 10Dzahn: gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001 [puppet] - 10https://gerrit.wikimedia.org/r/784765 (https://phabricator.wikimedia.org/T297659) [21:20:28] (03PS1) 10Jeena Huneidi: Revert "Create 'uploader' group for viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784766 [21:21:19] (03CR) 10Jeena Huneidi: [C: 03+2] "backport cancelled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784766 (owner: 10Jeena Huneidi) [21:21:34] Hello jeena [21:21:43] hello, I've just reverted your change [21:22:33] (03Merged) 10jenkins-bot: Revert "Create 'uploader' group for viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784766 (owner: 10Jeena Huneidi) [21:22:40] I left the web window, and it ended up disconnecting [21:22:59] mwdebug1001 or 1002? [21:23:05] mwdebug1001 [21:23:09] it should still be there [21:23:24] I will undo the revert and sync if it looks good [21:23:25] Okay [21:23:46] (03CR) 10Dzahn: P:wikidough: add a check to ensure service has been restarted (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [21:25:58] (03PS4) 10RLazarus: varnish: Rename public_clouds.json to ipblock_cloud.json [puppet] - 10https://gerrit.wikimedia.org/r/784761 (https://phabricator.wikimedia.org/T305581) [21:26:13] (03CR) 10Dzahn: [C: 03+1] "despite my comment I say "just ship it" and test in production if it works as expected" [puppet] - 10https://gerrit.wikimedia.org/r/784697 (owner: 10Ssingh) [21:26:56] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34924/console" [puppet] - 10https://gerrit.wikimedia.org/r/784761 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [21:27:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:27:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:56] jeena: I checked and approved [21:27:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25818 and previous config saved to /var/cache/conftool/dbconfig/20220420-212800-ladsgroup.json [21:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:04] thank you Juan_90264 [21:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:10] I will let you know when it's synced [21:28:44] (03PS1) 10Jeena Huneidi: Revert "Revert "Create 'uploader' group for viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784711 [21:29:30] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784711 (owner: 10Jeena Huneidi) [21:29:48] (03CR) 10Dzahn: [C: 03+2] gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001 [puppet] - 10https://gerrit.wikimedia.org/r/784765 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [21:30:12] (03Merged) 10jenkins-bot: Revert "Revert "Create 'uploader' group for viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784711 (owner: 10Jeena Huneidi) [21:30:48] Okay [21:31:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [21:31:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [21:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25819 and previous config saved to /var/cache/conftool/dbconfig/20220420-213115-ladsgroup.json [21:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:21] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:32:19] !log jhuneidi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:784711|Revert "Revert "Create 'uploader' group for viwiki""]] (duration: 00m 53s) [21:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:24] Juan_90264: done [21:32:40] For anyone who needs to know, that concludes today's backports [21:33:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:33:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:40] Okay, thanks Jeena for deploying! [21:33:51] You're welcome [21:34:07] (03CR) 10RLazarus: [V: 03+1] varnish: Rename public_clouds.json to ipblock_cloud.json (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784761 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [21:38:03] (03CR) 10Dzahn: "yep, this worked. first puppet run ok now. reverting it to continue" [puppet] - 10https://gerrit.wikimedia.org/r/784765 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [21:38:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:38:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:06] (03PS1) 10Dzahn: Revert "gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001" [puppet] - 10https://gerrit.wikimedia.org/r/784712 [21:39:28] (03CR) 10Dzahn: [C: 03+2] Revert "gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001" [puppet] - 10https://gerrit.wikimedia.org/r/784712 (owner: 10Dzahn) [21:43:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25820 and previous config saved to /var/cache/conftool/dbconfig/20220420-214305-ladsgroup.json [21:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:54:38] (03CR) 10Dzahn: deployment-prep: replace eqiad.wmflabs with eqiad1.wikimedia.cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784753 (owner: 10Dzahn) [21:56:57] (03CR) 10Dzahn: deployment-prep: replace eqiad.wmflabs with eqiad1.wikimedia.cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784753 (owner: 10Dzahn) [21:58:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25821 and previous config saved to /var/cache/conftool/dbconfig/20220420-215810-ladsgroup.json [21:58:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [21:58:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [21:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:58:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25822 and previous config saved to /var/cache/conftool/dbconfig/20220420-215818-ladsgroup.json [21:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:18] (03CR) 10Dzahn: "So this would have ben the same thing for cloudinfra but I would not merge this. it's more "FYI"" [puppet] - 10https://gerrit.wikimedia.org/r/784755 (owner: 10Dzahn) [22:00:42] (03Abandoned) 10Dzahn: deployment-prep: replace eqiad.wmflabs with eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/784753 (owner: 10Dzahn) [22:00:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25823 and previous config saved to /var/cache/conftool/dbconfig/20220420-220048-ladsgroup.json [22:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:52] (03Abandoned) 10Dzahn: cloudinfra: replace eqiad.wmflabs with eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/784755 (owner: 10Dzahn) [22:06:49] (03PS5) 10Andrew Bogott: Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) [22:06:51] (03PS1) 10Andrew Bogott: ceph::mon: actually install mon and mgr packages on mon nodes [puppet] - 10https://gerrit.wikimedia.org/r/784767 [22:07:17] (03CR) 10Dzahn: "after this the puppet errors are back : PANIC: mkdir /home/gitlab-runner: permission denied" [puppet] - 10https://gerrit.wikimedia.org/r/784712 (owner: 10Dzahn) [22:07:40] (03CR) 10jerkins-bot: [V: 04-1] ceph::mon: actually install mon and mgr packages on mon nodes [puppet] - 10https://gerrit.wikimedia.org/r/784767 (owner: 10Andrew Bogott) [22:09:24] (03PS1) 10Dzahn: Revert "Revert "gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001"" [puppet] - 10https://gerrit.wikimedia.org/r/784713 [22:11:10] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001"" [puppet] - 10https://gerrit.wikimedia.org/r/784713 (owner: 10Dzahn) [22:13:50] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2006-dev.codfw.wmnet with OS buster [22:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcephmon2006-dev.codfw.wmn... [22:14:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS buster [22:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcephmon2005-dev.codfw.wmn... [22:15:57] (03PS6) 10Andrew Bogott: Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) [22:15:59] (03PS1) 10Andrew Bogott: ceph::mon: actually install mon and mgr packages on mon nodes [puppet] - 10https://gerrit.wikimedia.org/r/784768 [22:31:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25824 and previous config saved to /var/cache/conftool/dbconfig/20220420-223129-ladsgroup.json [22:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:38:41] (03CR) 10Dzahn: "so the code is like "if NOT using root THEN create the home dir". That means if it's root then things work but the home dir does not get c" [puppet] - 10https://gerrit.wikimedia.org/r/784712 (owner: 10Dzahn) [22:46:00] (JobUnavailable) firing: (2) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:46:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:46:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25825 and previous config saved to /var/cache/conftool/dbconfig/20220420-224634-ladsgroup.json [22:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:00] (03PS1) 10RLazarus: varnish: Allow using netmapper with multiple requestctl ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) [22:49:26] (03PS2) 10RLazarus: varnish: Allow using netmapper with multiple requestctl ipblock types [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) [22:50:09] (03CR) 10jerkins-bot: [V: 04-1] varnish: Allow using netmapper with multiple requestctl ipblock types [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [22:50:27] (03PS7) 10Andrew Bogott: Make new cloudweb2002-dev node into a cloudweb node [puppet] - 10https://gerrit.wikimedia.org/r/784738 (https://phabricator.wikimedia.org/T304881) [22:50:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [22:50:29] (03PS1) 10Andrew Bogott: Add host entries for new cloudservices-dev nodes [puppet] - 10https://gerrit.wikimedia.org/r/784775 [22:50:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [22:50:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [22:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [22:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:24] (03PS3) 10RLazarus: varnish: Allow using netmapper with multiple requestctl ipblock types [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) [22:51:45] (03CR) 10Andrew Bogott: [C: 03+2] Add host entries for new cloudservices-dev nodes [puppet] - 10https://gerrit.wikimedia.org/r/784775 (owner: 10Andrew Bogott) [22:52:17] (03CR) 10jerkins-bot: [V: 04-1] varnish: Allow using netmapper with multiple requestctl ipblock types [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [22:52:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [22:52:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [22:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:07] (03CR) 10Andrew Bogott: [C: 03+2] ceph::mon: actually install mon and mgr packages on mon nodes [puppet] - 10https://gerrit.wikimedia.org/r/784768 (owner: 10Andrew Bogott) [22:56:04] (03PS4) 10RLazarus: varnish: Allow using netmapper with multiple requestctl ipblock types [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) [22:56:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [22:56:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [22:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25826 and previous config saved to /var/cache/conftool/dbconfig/20220420-225643-ladsgroup.json [22:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:01:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25827 and previous config saved to /var/cache/conftool/dbconfig/20220420-230140-ladsgroup.json [23:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:03:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25828 and previous config saved to /var/cache/conftool/dbconfig/20220420-230303-ladsgroup.json [23:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:09] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:06:12] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34928/console" [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [23:06:39] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [23:07:03] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:10:06] (03PS5) 10RLazarus: varnish: Allow using netmapper with multiple requestctl ipblock types [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) [23:11:07] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34929/console" [puppet] - 10https://gerrit.wikimedia.org/r/784774 (https://phabricator.wikimedia.org/T305581) (owner: 10RLazarus) [23:16:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25829 and previous config saved to /var/cache/conftool/dbconfig/20220420-231645-ladsgroup.json [23:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:17:13] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:17:15] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25830 and previous config saved to /var/cache/conftool/dbconfig/20220420-231808-ladsgroup.json [23:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:41] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 15, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:21:43] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:32:09] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [23:32:35] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.22 ms [23:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25831 and previous config saved to /var/cache/conftool/dbconfig/20220420-233313-ladsgroup.json [23:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:33:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:14] !log kubernetes/puppetmaster: added deployment/user tokens for new service image-suggestion T304891 [23:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:20] T304891: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 [23:40:19] (03CR) 10CDanis: [C: 03+2] Proof of concept for haproxy statistics tracking [puppet] - 10https://gerrit.wikimedia.org/r/784309 (owner: 10CDanis) [23:44:56] (03PS1) 10CDanis: haproxy stats: fix missing key [puppet] - 10https://gerrit.wikimedia.org/r/784787 [23:47:01] (03CR) 10CDanis: [C: 03+2] haproxy stats: fix missing key [puppet] - 10https://gerrit.wikimedia.org/r/784787 (owner: 10CDanis) [23:48:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25832 and previous config saved to /var/cache/conftool/dbconfig/20220420-234818-ladsgroup.json [23:48:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:48:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:48:24] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:48:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25833 and previous config saved to /var/cache/conftool/dbconfig/20220420-234831-ladsgroup.json [23:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:47] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:50:51] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:53:58] (KubernetesRsyslogDown) firing: rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:57:31] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 15, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:57:37] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:58:58] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestagemaster1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown