[00:08:50] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:14:25] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:17:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [00:17:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [00:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23313 and previous config saved to /var/cache/conftool/dbconfig/20220328-001707-ladsgroup.json [00:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:21:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:24:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:25:14] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:40:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23314 and previous config saved to /var/cache/conftool/dbconfig/20220328-004027-ladsgroup.json [00:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:49:52] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:53:40] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:55:28] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 59.59 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [00:55:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23315 and previous config saved to /var/cache/conftool/dbconfig/20220328-005533-ladsgroup.json [00:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:38] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:57:36] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:10:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P23316 and previous config saved to /var/cache/conftool/dbconfig/20220328-011038-ladsgroup.json [01:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298565)', diff saved to https://phabricator.wikimedia.org/P23317 and previous config saved to /var/cache/conftool/dbconfig/20220328-012543-ladsgroup.json [01:25:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [01:25:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [01:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23318 and previous config saved to /var/cache/conftool/dbconfig/20220328-012553-ladsgroup.json [01:25:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:35] (03CR) 10AntiCompositeNumber: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773320 (https://phabricator.wikimedia.org/T303579) (owner: 10NguoiDungKhongDinhDanh) [01:38:46] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:00] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:52:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23319 and previous config saved to /var/cache/conftool/dbconfig/20220328-015241-ladsgroup.json [01:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:06:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:07:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23320 and previous config saved to /var/cache/conftool/dbconfig/20220328-020746-ladsgroup.json [02:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:09] (03CR) 10Ottomata: Add an alert for zero messages being generated by varnishkafka instances (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [02:16:21] (03CR) 10Ottomata: [C: 03+1] "looks like a small indent fix needed, but otherwise +1! :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773896 (https://phabricator.wikimedia.org/T304336) (owner: 10Sharvaniharan) [02:22:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:22:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P23321 and previous config saved to /var/cache/conftool/dbconfig/20220328-022251-ladsgroup.json [02:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:54] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298565)', diff saved to https://phabricator.wikimedia.org/P23322 and previous config saved to /var/cache/conftool/dbconfig/20220328-023756-ladsgroup.json [02:37:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [02:37:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [02:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:38:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23323 and previous config saved to /var/cache/conftool/dbconfig/20220328-023804-ladsgroup.json [02:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:18] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:18:02] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:58] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:31:10] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:38:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23324 and previous config saved to /var/cache/conftool/dbconfig/20220328-033818-ladsgroup.json [03:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:40:18] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:48:04] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:53:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23325 and previous config saved to /var/cache/conftool/dbconfig/20220328-035323-ladsgroup.json [03:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P23326 and previous config saved to /var/cache/conftool/dbconfig/20220328-040829-ladsgroup.json [04:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:42] 10ops-eqiad: Eqiad: asw2-a-eqiad:xe-2/0/40 interface up with no description - https://phabricator.wikimedia.org/T304807 (10Papaul) [04:21:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:23:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298565)', diff saved to https://phabricator.wikimedia.org/P23327 and previous config saved to /var/cache/conftool/dbconfig/20220328-042334-ladsgroup.json [04:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:40] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:25:10] PROBLEM - Check systemd state on mw1308 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:48] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:08:34] PROBLEM - SSH on kubernetes2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:12:44] (03PS1) 10Samwilson: [config] Enable Realtime Preview on Beta enwiki, enwikisource, and hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774182 (https://phabricator.wikimedia.org/T303961) [05:28:10] PROBLEM - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:29:28] (03PS2) 10Samwilson: Enable Realtime Preview on Beta enwiki, enwikisource, and hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774182 (https://phabricator.wikimedia.org/T303961) [05:30:18] RECOVERY - LVS zotero codfw port 4969/tcp - Zotero- zotero.svc.codfw.wmnet IPv4 #page on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.147 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:32:09] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: sync [05:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:27] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: sync [05:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:35:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:37:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance [05:37:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [05:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2110.codfw.wmnet with reason: Maintenance [05:37:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 12 hosts with reason: Maintenance [05:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 12 hosts with reason: Maintenance [05:38:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [05:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1121.eqiad.wmnet with reason: Maintenance [05:38:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298556)', diff saved to https://phabricator.wikimedia.org/P23328 and previous config saved to /var/cache/conftool/dbconfig/20220328-053816-marostegui.json [05:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:26] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [05:44:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:45:18] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:45:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099 for downgrade', diff saved to https://phabricator.wikimedia.org/P23329 and previous config saved to /var/cache/conftool/dbconfig/20220328-054552-marostegui.json [05:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:28] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:01:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 10%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23330 and previous config saved to /var/cache/conftool/dbconfig/20220328-060123-root.json [06:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 10%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23331 and previous config saved to /var/cache/conftool/dbconfig/20220328-060138-root.json [06:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 for downgrade', diff saved to https://phabricator.wikimedia.org/P23332 and previous config saved to /var/cache/conftool/dbconfig/20220328-060239-marostegui.json [06:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298556)', diff saved to https://phabricator.wikimedia.org/P23333 and previous config saved to /var/cache/conftool/dbconfig/20220328-060525-marostegui.json [06:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:30] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [06:05:34] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:06:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: After schema downgrade ', diff saved to https://phabricator.wikimedia.org/P23334 and previous config saved to /var/cache/conftool/dbconfig/20220328-060645-root.json [06:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:09:38] RECOVERY - SSH on kubernetes2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:16:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 25%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23335 and previous config saved to /var/cache/conftool/dbconfig/20220328-061627-root.json [06:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23336 and previous config saved to /var/cache/conftool/dbconfig/20220328-061642-root.json [06:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P23337 and previous config saved to /var/cache/conftool/dbconfig/20220328-062030-marostegui.json [06:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: After schema downgrade ', diff saved to https://phabricator.wikimedia.org/P23338 and previous config saved to /var/cache/conftool/dbconfig/20220328-062149-root.json [06:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:40] (03PS4) 10Elukey: Add helmfile config for Istio proxy sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) [06:23:43] (03CR) 10Elukey: Add helmfile config for Istio proxy sidecars (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [06:31:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 50%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23339 and previous config saved to /var/cache/conftool/dbconfig/20220328-063131-root.json [06:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23340 and previous config saved to /var/cache/conftool/dbconfig/20220328-063146-root.json [06:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P23341 and previous config saved to /var/cache/conftool/dbconfig/20220328-063535-marostegui.json [06:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: After schema downgrade ', diff saved to https://phabricator.wikimedia.org/P23342 and previous config saved to /var/cache/conftool/dbconfig/20220328-063652-root.json [06:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:49] (03PS1) 10Filippo Giunchedi: logging: bump alerts logs retention [puppet] - 10https://gerrit.wikimedia.org/r/774364 [06:38:58] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:39:24] PROBLEM - SSH on ml-serve-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:41:44] (03PS31) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [06:42:58] (KubernetesCalicoDown) firing: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:43:41] (03PS32) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [06:43:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:43:46] (03PS1) 10Marostegui: db1175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/774366 (https://phabricator.wikimedia.org/T304810) [06:44:57] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34573/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [06:45:00] (03CR) 10Marostegui: [C: 03+2] db1175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/774366 (https://phabricator.wikimedia.org/T304810) (owner: 10Marostegui) [06:46:28] (03PS4) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [06:46:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 75%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23343 and previous config saved to /var/cache/conftool/dbconfig/20220328-064635-root.json [06:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23344 and previous config saved to /var/cache/conftool/dbconfig/20220328-064650-root.json [06:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:32] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34574/console" [puppet] - 10https://gerrit.wikimedia.org/r/773185 (owner: 10Elukey) [06:47:58] (03PS33) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [06:48:08] (03PS5) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [06:49:16] (03CR) 10Elukey: "After a chat with Joe I tried to take a simpler approach, less things packed into the cni define and more things composed in the calico pr" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [06:50:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298556)', diff saved to https://phabricator.wikimedia.org/P23345 and previous config saved to /var/cache/conftool/dbconfig/20220328-065040-marostegui.json [06:50:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [06:50:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1141.eqiad.wmnet with reason: Maintenance [06:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:48] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [06:50:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298556)', diff saved to https://phabricator.wikimedia.org/P23346 and previous config saved to /var/cache/conftool/dbconfig/20220328-065048-marostegui.json [06:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: After schema downgrade ', diff saved to https://phabricator.wikimedia.org/P23347 and previous config saved to /var/cache/conftool/dbconfig/20220328-065156-root.json [06:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:21] !log reboot ml-serve-ctrl1002 - ganeti console available but slow (attempted to root login but never get to input the password) [06:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:54] RECOVERY - SSH on ml-serve-ctrl1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:55:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:57:14] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:57:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:58:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Amir1, awight, Urbanecm, and taavi: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T0700). [07:00:05] samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:27] Hello. I'm here. :) [07:00:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175 for downgrade', diff saved to https://phabricator.wikimedia.org/P23348 and previous config saved to /var/cache/conftool/dbconfig/20220328-070056-marostegui.json [07:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3311 (re)pooling @ 100%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23349 and previous config saved to /var/cache/conftool/dbconfig/20220328-070139-root.json [07:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23350 and previous config saved to /var/cache/conftool/dbconfig/20220328-070154-root.json [07:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:35] (03PS1) 10Marostegui: Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/773984 [07:05:34] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: phabricator_clean_tmp_files.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:58] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:06:06] (03CR) 10Marostegui: [C: 03+2] Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/773984 (owner: 10Marostegui) [07:07:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: After schema downgrade ', diff saved to https://phabricator.wikimedia.org/P23351 and previous config saved to /var/cache/conftool/dbconfig/20220328-070700-root.json [07:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:18] Amir1, awight, Urbanecm, or taavi: are you deploying today? [07:08:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23352 and previous config saved to /var/cache/conftool/dbconfig/20220328-070825-root.json [07:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:15] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [07:10:18] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [07:12:59] !log updated d-i images for Bullseye 11.3 release T304599 [07:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:04] T304599: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 [07:13:20] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [07:14:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P23353 and previous config saved to /var/cache/conftool/dbconfig/20220328-071427-marostegui.json [07:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:01] (03CR) 10NguoiDungKhongDinhDanh: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774386 (owner: 10NguoiDungKhongDinhDanh) [07:18:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298556)', diff saved to https://phabricator.wikimedia.org/P23354 and previous config saved to /var/cache/conftool/dbconfig/20220328-071846-marostegui.json [07:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:52] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [07:23:21] (03PS3) 10NguoiDungKhongDinhDanh: Fix Id1fa4d6b02155c940c2b40b1c5411d5479dc7d2b: Add viwiki eliminators to wgContentTranslationPublishRequirements. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774386 (https://phabricator.wikimedia.org/T299636) [07:23:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23355 and previous config saved to /var/cache/conftool/dbconfig/20220328-072329-root.json [07:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:03] samwilson: sorry I'm late, I can deploy today [07:25:18] * taavi waits for the bridge to fix itself [07:25:41] taavi: no worries! [07:25:50] thanks [07:26:05] how did you see my message while not here? I thought the bridge was not supposed to do that [07:26:39] hmm I'm not sure! it all looks fine from here [07:27:09] (03CR) 10Majavah: [C: 03+2] Enable Realtime Preview on Beta enwiki, enwikisource, and hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774182 (https://phabricator.wikimedia.org/T303961) (owner: 10Samwilson) [07:27:51] (03Merged) 10jenkins-bot: Enable Realtime Preview on Beta enwiki, enwikisource, and hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774182 (https://phabricator.wikimedia.org/T303961) (owner: 10Samwilson) [07:28:56] samwilson: just merged your patch, it should auto deploy to beta within the next 30 mins or so, feel free to ping me if it does not [07:29:19] taavi: great, thanks! [07:29:21] Krinkle: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/769757 was never pulled to deploy1002, does it need to be synced? [07:30:25] (03PS2) 10Majavah: Remove unused CentralAuth settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773441 [07:30:43] (03CR) 10Majavah: [C: 03+2] Remove unused CentralAuth settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773441 (owner: 10Majavah) [07:31:31] (03Merged) 10jenkins-bot: Remove unused CentralAuth settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773441 (owner: 10Majavah) [07:33:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P23356 and previous config saved to /var/cache/conftool/dbconfig/20220328-073351-marostegui.json [07:33:52] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:773441|Remove unused CentralAuth settings]] (1/2) (duration: 00m 56s) [07:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:57] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:773441|Remove unused CentralAuth settings]] (2/2) (duration: 00m 55s) [07:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:35:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:11] (03PS2) 10Filippo Giunchedi: sre: add ProbeDown paging alert for enabled services [alerts] - 10https://gerrit.wikimedia.org/r/773747 (https://phabricator.wikimedia.org/T291946) [07:36:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:35] (03CR) 10Filippo Giunchedi: sre: add ProbeDown paging alert for enabled services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/773747 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [07:38:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23357 and previous config saved to /var/cache/conftool/dbconfig/20220328-073833-root.json [07:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:22] !log updated d-i images for Buster 10.12 release T304546 [07:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:26] T304546: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 [07:40:12] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [07:45:14] (03CR) 10Filippo Giunchedi: profile: issue warnings for check_mw_versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767729 (https://phabricator.wikimedia.org/T302832) (owner: 10Filippo Giunchedi) [07:46:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:40] (03CR) 10Ayounsi: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [07:48:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P23358 and previous config saved to /var/cache/conftool/dbconfig/20220328-074856-marostegui.json [07:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:54] (03PS1) 10Marostegui: db1106,db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/774371 (https://phabricator.wikimedia.org/T304812) [07:50:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:50:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:47] (03CR) 10Marostegui: [C: 03+2] db1106,db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/774371 (https://phabricator.wikimedia.org/T304812) (owner: 10Marostegui) [07:51:08] !log dbmaint s1@codfw T304812 [07:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:13] T304812: Rebuild logging table on s1 hosts - https://phabricator.wikimedia.org/T304812 [07:53:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23359 and previous config saved to /var/cache/conftool/dbconfig/20220328-075337-root.json [07:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23360 and previous config saved to /var/cache/conftool/dbconfig/20220328-075451-root.json [07:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:09] (03CR) 10Filippo Giunchedi: Add an alert for zero messages being generated by varnishkafka instances (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [07:59:40] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) Checking LLDP it looks all good to me. I'd have preferred that the links are not crossed between FPC and CR, for example that all the links on FPC2 go to cr1, but now is not a good time to change... [07:59:43] (03PS1) 10Marostegui: Revert "db1106,db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/774388 [08:03:25] (03CR) 10Marostegui: [C: 03+2] Revert "db1106,db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/774388 (owner: 10Marostegui) [08:03:49] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), 10Sustainability (Incident Followup): Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) [08:04:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298556)', diff saved to https://phabricator.wikimedia.org/P23361 and previous config saved to /var/cache/conftool/dbconfig/20220328-080401-marostegui.json [08:04:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [08:04:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1142.eqiad.wmnet with reason: Maintenance [08:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:07] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [08:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298556)', diff saved to https://phabricator.wikimedia.org/P23362 and previous config saved to /var/cache/conftool/dbconfig/20220328-080409-marostegui.json [08:04:10] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), 10Sustainability (Incident Followup): Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) p:05High→03Medium The root cause is fixed AFAICS and @jbond has added bac... [08:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 10%: After downgrade ', diff saved to https://phabricator.wikimedia.org/P23363 and previous config saved to /var/cache/conftool/dbconfig/20220328-080429-root.json [08:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:33] (03PS1) 10Ayounsi: Papaul operations -> super-user [homer/public] - 10https://gerrit.wikimedia.org/r/774372 [08:07:46] (03PS1) 10Marostegui: mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/774373 (https://phabricator.wikimedia.org/T301850) [08:08:34] (03PS1) 10Marostegui: wmnet: Update s3 master CNAME [dns] - 10https://gerrit.wikimedia.org/r/774374 (https://phabricator.wikimedia.org/T301850) [08:08:39] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/774373 (https://phabricator.wikimedia.org/T301850) (owner: 10Marostegui) [08:08:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: After downgrade', diff saved to https://phabricator.wikimedia.org/P23364 and previous config saved to /var/cache/conftool/dbconfig/20220328-080841-root.json [08:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:59] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/774374 (https://phabricator.wikimedia.org/T301850) (owner: 10Marostegui) [08:09:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23365 and previous config saved to /var/cache/conftool/dbconfig/20220328-080955-root.json [08:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:06] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10Patch-For-Review: Create a read-only swift identity for backup taking - https://phabricator.wikimedia.org/T269108 (10MatthewVernon) Is this urgent? I have a number of things I'm trying to land before the end of the quarter... [08:12:00] (03CR) 10Ayounsi: [C: 03+2] Papaul operations -> super-user [homer/public] - 10https://gerrit.wikimedia.org/r/774372 (owner: 10Ayounsi) [08:12:12] (03CR) 10Ladsgroup: [C: 03+1] mariadb: Promote db1123 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/774373 (https://phabricator.wikimedia.org/T301850) (owner: 10Marostegui) [08:12:23] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Update s3 master CNAME [dns] - 10https://gerrit.wikimedia.org/r/774374 (https://phabricator.wikimedia.org/T301850) (owner: 10Marostegui) [08:13:55] (03Merged) 10jenkins-bot: Papaul operations -> super-user [homer/public] - 10https://gerrit.wikimedia.org/r/774372 (owner: 10Ayounsi) [08:18:07] (03PS2) 10David Caro: discovery: remove unneeded protected-access supression [cookbooks] - 10https://gerrit.wikimedia.org/r/773744 [08:19:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: After downgrade ', diff saved to https://phabricator.wikimedia.org/P23366 and previous config saved to /var/cache/conftool/dbconfig/20220328-081933-root.json [08:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:10] (03CR) 10jerkins-bot: [V: 04-1] discovery: remove unneeded protected-access supression [cookbooks] - 10https://gerrit.wikimedia.org/r/773744 (owner: 10David Caro) [08:24:16] (03CR) 10JMeybohm: Add helmfile config for Istio proxy sidecars (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:25:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23367 and previous config saved to /var/cache/conftool/dbconfig/20220328-082459-root.json [08:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:09] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298556)', diff saved to https://phabricator.wikimedia.org/P23368 and previous config saved to /var/cache/conftool/dbconfig/20220328-082518-marostegui.json [08:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:23] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [08:26:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:39] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34575/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:27:23] (03PS1) 10Filippo Giunchedi: icinga: quote check_http_url_for_regexp_on_port regex argument [puppet] - 10https://gerrit.wikimedia.org/r/774377 (https://phabricator.wikimedia.org/T304323) [08:27:39] (03PS1) 10Jcrespo: mediabackup: Update backup of testwiki media on codfw [puppet] - 10https://gerrit.wikimedia.org/r/774378 (https://phabricator.wikimedia.org/T299764) [08:30:18] (03PS4) 10KartikMistry: Add viwiki eliminators to wgContentTranslationPublishRequirements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774386 (https://phabricator.wikimedia.org/T299636) (owner: 10NguoiDungKhongDinhDanh) [08:30:35] (03PS5) 10KartikMistry: Add viwiki eliminators to wgContentTranslationPublishRequirements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774386 (https://phabricator.wikimedia.org/T299636) (owner: 10NguoiDungKhongDinhDanh) [08:32:43] (03CR) 10KartikMistry: [C: 03+1] "Patch looks good. I'll schedule for its deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774386 (https://phabricator.wikimedia.org/T299636) (owner: 10NguoiDungKhongDinhDanh) [08:34:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: After downgrade ', diff saved to https://phabricator.wikimedia.org/P23369 and previous config saved to /var/cache/conftool/dbconfig/20220328-083437-root.json [08:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:03] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:35:26] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34576/console" [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [08:37:26] jouncebot: nowandnext [08:37:26] No deployments scheduled for the next 4 hour(s) and 22 minute(s) [08:37:26] In 4 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T1300) [08:37:27] (03CR) 10Elukey: Add helmfile config for Istio proxy sidecars (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:37:32] noice [08:38:21] (03CR) 10Jelto: [V: 03+1] "@dzahn I'm confused by the PCC output NOOP. I would expect a change for gitlab-restore.sh script. Is this a PCC issue or is something else" [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [08:40:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23370 and previous config saved to /var/cache/conftool/dbconfig/20220328-084003-root.json [08:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P23371 and previous config saved to /var/cache/conftool/dbconfig/20220328-084023-marostegui.json [08:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:25] (03PS2) 10Ladsgroup: Enable videojs in the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773938 (https://phabricator.wikimedia.org/T248418) [08:41:37] (03CR) 10Ladsgroup: [C: 03+2] Enable videojs in the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773938 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup) [08:42:24] (03Merged) 10jenkins-bot: Enable videojs in the second batch of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773938 (https://phabricator.wikimedia.org/T248418) (owner: 10Ladsgroup) [08:43:35] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:773938|Enable videojs in the second batch of wikis (T248418)]] (duration: 00m 55s) [08:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:41] T248418: Roll out videojs as the only video/audio player on all Wikimedia wikis - https://phabricator.wikimedia.org/T248418 [08:43:59] (03PS2) 10Ladsgroup: Enable WRITE BOTH for templatelinks normalization in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773594 (https://phabricator.wikimedia.org/T299421) [08:44:03] (03CR) 10Ladsgroup: [C: 03+2] Enable WRITE BOTH for templatelinks normalization in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773594 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [08:44:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:44:46] (03Merged) 10jenkins-bot: Enable WRITE BOTH for templatelinks normalization in more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773594 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [08:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:39] marostegui: heads up, I'm deploying write both on the new templatelinks columns in one wiki in sections that's done to make sure replication is not breaking [08:45:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:45:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:50] Amir1: oki [08:46:05] <_joe_> !log uploading conftool 2.0.0, T302471 [08:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:38] (03PS1) 10Phedenskog: grafana: provision JSON datasource [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) [08:46:52] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:773594|Enable WRITE BOTH for templatelinks normalization in more wikis (T299421)]] (duration: 00m 54s) [08:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:56] T299421: Turn on write both in production for templatelinks normalization - https://phabricator.wikimedia.org/T299421 [08:46:58] (03PS1) 10Majavah: paws: add paws prometheus role/profile [puppet] - 10https://gerrit.wikimedia.org/r/774381 (https://phabricator.wikimedia.org/T304716) [08:47:00] (03PS1) 10Majavah: paws: add haproxy routing for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/774382 (https://phabricator.wikimedia.org/T304716) [08:47:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169 T304812', diff saved to https://phabricator.wikimedia.org/P23373 and previous config saved to /var/cache/conftool/dbconfig/20220328-084705-marostegui.json [08:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:10] T304812: Rebuild logging table on s1 hosts - https://phabricator.wikimedia.org/T304812 [08:47:45] !log dbmaint s1@eqiad T304812 [08:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:55] (03CR) 10Phedenskog: "Is that what it should look like?" [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [08:47:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: After downgrade ', diff saved to https://phabricator.wikimedia.org/P23374 and previous config saved to /var/cache/conftool/dbconfig/20220328-084941-root.json [08:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:12] !log deploy new alerting (0.7.1) for db backups at alert1001 T138562 [08:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:16] T138562: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 [08:53:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P23375 and previous config saved to /var/cache/conftool/dbconfig/20220328-085507-root.json [08:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P23376 and previous config saved to /var/cache/conftool/dbconfig/20220328-085528-marostegui.json [08:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:55:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:52] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @ArielGlenn Hi, I already come back to you! T299993 is hidden to me and I have no visibility on it. Is that already implemented.... [08:56:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:17] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10ArielGlenn) >>! In T57503#7809873, @Kelson wrote: > @ArielGlenn Hi, I already come back to you! T299993 is hidden to me and I have no vi... [09:00:58] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add a cookbook to remove queue errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774385 [09:02:23] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @ArielGlenn Sorry, I meant T286588 [09:03:58] !log installing Linux 5.10.106 on Bullseye hosts [09:04:00] (03PS3) 10MMandere: site: Reimage cp2033 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773250 (https://phabricator.wikimedia.org/T290005) [09:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: After downgrade ', diff saved to https://phabricator.wikimedia.org/P23377 and previous config saved to /var/cache/conftool/dbconfig/20220328-090445-root.json [09:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:15] (03PS1) 10Jcrespo: check: Fix typo causing x1 section to be unrecognized [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/774406 (https://phabricator.wikimedia.org/T138562) [09:06:12] (03PS2) 10Jcrespo: check: Fix typo causing x1 section to be unrecognized [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/774406 (https://phabricator.wikimedia.org/T138562) [09:06:50] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10ArielGlenn) >>! In T57503#7809889, @Kelson wrote: > @ArielGlenn Sorry, I meant T286588 You can follow along with the install at T302981 [09:09:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Note that for full deployment we have to update the package in production; you can run tests on https://grafana-next.wikimedia.org/ " [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [09:10:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298556)', diff saved to https://phabricator.wikimedia.org/P23378 and previous config saved to /var/cache/conftool/dbconfig/20220328-091033-marostegui.json [09:10:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [09:10:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1143.eqiad.wmnet with reason: Maintenance [09:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:39] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [09:10:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298556)', diff saved to https://phabricator.wikimedia.org/P23379 and previous config saved to /var/cache/conftool/dbconfig/20220328-091041-marostegui.json [09:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:01] !log depool cp2033 for reimage - T290005 [09:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:05] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:12:04] (03CR) 10JMeybohm: [C: 03+1] Add helmfile config for Istio proxy sidecars (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [09:13:02] (03CR) 10Jcrespo: [C: 03+2] check: Fix typo causing x1 section to be unrecognized [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/774406 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:13:44] !log installing Linux 4.19.235 on Buster hosts [09:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:59] (03CR) 10MMandere: [C: 03+2] site: Reimage cp2033 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773250 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:14:20] (03PS1) 10Filippo Giunchedi: icinga: remove double quoting for gerrit health check [puppet] - 10https://gerrit.wikimedia.org/r/774407 (https://phabricator.wikimedia.org/T304323) [09:14:22] (03PS1) 10Filippo Giunchedi: lists: remove double quoting for http check [puppet] - 10https://gerrit.wikimedia.org/r/774408 (https://phabricator.wikimedia.org/T304323) [09:17:21] (03PS23) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [09:18:22] (03PS12) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [09:18:49] (03CR) 10jerkins-bot: [V: 04-1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [09:19:00] (03CR) 10MVernon: "I've taken all the straightforward changes; I'll come back to the others in due course (I'll make a phab task so as not to lose them!)" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [09:20:29] (03PS24) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [09:21:46] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [09:22:14] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [09:22:34] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1003/34577/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/774377 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [09:24:34] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp2033.codfw.wmnet with OS buster [09:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:43] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2033.codfw.wmnet with OS buster [09:30:40] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add a cookbook to remove queue errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774385 [09:31:34] (03PS1) 10Jelto: gitlab: move systemd interval for backup and restore to hiera [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) [09:31:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298556)', diff saved to https://phabricator.wikimedia.org/P23382 and previous config saved to /var/cache/conftool/dbconfig/20220328-093148-marostegui.json [09:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:55] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [09:32:55] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add a cookbook to remove queue errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774385 [09:34:08] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34579/console" [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [09:35:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P23383 and previous config saved to /var/cache/conftool/dbconfig/20220328-093503-root.json [09:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:43] (03PS2) 10Jelto: gitlab: move systemd interval for backup and restore to hiera [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) [09:38:37] (03CR) 10Muehlenhoff: mediabackup::storage: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771560 (owner: 10Muehlenhoff) [09:39:06] (03CR) 10David Caro: [C: 03+1] "LGTM, any nits can be ignored." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774385 (owner: 10Arturo Borrero Gonzalez) [09:39:16] (03PS3) 10Jelto: gitlab: move systemd interval for backup and restore to hiera [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) [09:41:46] (03CR) 10Jcrespo: mediabackup::storage: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771560 (owner: 10Muehlenhoff) [09:42:32] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [09:42:41] (03PS40) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [09:43:01] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2033.codfw.wmnet with reason: host reimage [09:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:20] (03CR) 10David Caro: [C: 03+2] "recheck" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [09:44:00] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34580/console" [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [09:45:40] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2033.codfw.wmnet with reason: host reimage [09:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:29] !log installing Linux 4.9.303 on Stretch hosts [09:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:52] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:46:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P23384 and previous config saved to /var/cache/conftool/dbconfig/20220328-094653-marostegui.json [09:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:02] (03PS1) 10Jelto: gitlab: run backup and restore twice daily [puppet] - 10https://gerrit.wikimedia.org/r/774418 (https://phabricator.wikimedia.org/T274463) [09:48:04] (03Abandoned) 10David Caro: discovery: remove unneeded protected-access supression [cookbooks] - 10https://gerrit.wikimedia.org/r/773744 (owner: 10David Caro) [09:48:12] (03PS3) 10David Caro: wmcs: toolforge: k8s: show output of deploy.sh [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773743 (owner: 10Majavah) [09:49:56] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34581/console" [puppet] - 10https://gerrit.wikimedia.org/r/774418 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [09:50:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P23385 and previous config saved to /var/cache/conftool/dbconfig/20220328-095007-root.json [09:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:07] 10SRE, 10ops-eqiad: Eqiad: asw2-a-eqiad:xe-2/0/40 interface up with no description - https://phabricator.wikimedia.org/T304807 (10ayounsi) I used that opportunity to have LibreNMS open task for this kind of alert automatically instead of sending emails: T304818 T304819 T304822 Unfortunately it's not yet possi... [09:53:59] PROBLEM - SSH on ml-serve-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:55:04] (03PS1) 10Giuseppe Lavagetto: puppetmaster::frontend: install requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774420 (https://phabricator.wikimedia.org/T302471) [09:55:11] (03PS1) 10Giuseppe Lavagetto: conftool: remove request* objects from the schema [puppet] - 10https://gerrit.wikimedia.org/r/774421 (https://phabricator.wikimedia.org/T302471) [09:55:15] (03PS1) 10Giuseppe Lavagetto: conftool: remove request* objects from sync [puppet] - 10https://gerrit.wikimedia.org/r/774422 (https://phabricator.wikimedia.org/T302471) [09:56:01] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: install requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774420 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [09:56:58] (KubernetesCalicoDown) firing: ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:57:07] RECOVERY - SSH on ml-serve-ctrl1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:57:31] (03PS2) 10Giuseppe Lavagetto: puppetmaster::frontend: install requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774420 (https://phabricator.wikimedia.org/T302471) [09:57:45] (03PS3) 10Giuseppe Lavagetto: puppetmaster::frontend: install requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774420 (https://phabricator.wikimedia.org/T302471) [09:58:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:58:59] (03PS1) 10Muehlenhoff: Stop using profile::base::linux419 on Hadoop nodes [puppet] - 10https://gerrit.wikimedia.org/r/774423 [09:59:44] (03PS6) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [09:59:46] (03PS1) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [09:59:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] puppetmaster::frontend: install requestctl [puppet] - 10https://gerrit.wikimedia.org/r/774420 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [10:00:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: add a cookbook to remove queue errors (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774385 (owner: 10Arturo Borrero Gonzalez) [10:00:07] (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add a cookbook to remove queue errors [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774385 [10:00:51] PROBLEM - SSH on ml-serve-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:01:09] (03PS2) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [10:01:18] (03PS7) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [10:01:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P23386 and previous config saved to /var/cache/conftool/dbconfig/20220328-100159-marostegui.json [10:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/774423 (owner: 10Muehlenhoff) [10:03:25] RECOVERY - SSH on ml-serve-ctrl1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:05:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P23387 and previous config saved to /var/cache/conftool/dbconfig/20220328-100511-root.json [10:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:10:59] PROBLEM - SSH on ml-serve-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:13:06] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2033.codfw.wmnet with OS buster [10:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:14] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2033.codfw.wmnet with OS buster com... [10:14:27] (03PS3) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [10:14:29] (03PS8) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [10:15:21] (03PS34) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [10:15:28] (03PS4) 10Elukey: profile::calico::kubernetes: add optional istio-cni config [puppet] - 10https://gerrit.wikimedia.org/r/774424 (https://phabricator.wikimedia.org/T297612) [10:15:35] (03PS9) 10Elukey: WIP - Experiment with istio-cni plugin configs [puppet] - 10https://gerrit.wikimedia.org/r/773185 [10:15:51] .7 [10:15:55] uff [10:17:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298556)', diff saved to https://phabricator.wikimedia.org/P23389 and previous config saved to /var/cache/conftool/dbconfig/20220328-101704-marostegui.json [10:17:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:17:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:10] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [10:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298556)', diff saved to https://phabricator.wikimedia.org/P23390 and previous config saved to /var/cache/conftool/dbconfig/20220328-101712-marostegui.json [10:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:16] !log pool cp2033 with HAProxy as TLS termination layer - T290005 [10:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:24] (03CR) 10JMeybohm: [C: 04-1] Add helm charts and a helmfile configuration for datahub (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:25] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:17:55] RECOVERY - SSH on ml-serve-ctrl1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:18:45] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:18:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P23391 and previous config saved to /var/cache/conftool/dbconfig/20220328-102014-root.json [10:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:58] (KubernetesCalicoDown) resolved: ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:25:01] (03PS1) 10Giuseppe Lavagetto: Downgrade to use pyparsing 2.x, default version in buster/bullseye [software/conftool] - 10https://gerrit.wikimedia.org/r/774431 [10:25:03] (03PS1) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774432 [10:29:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:29:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T300775)', diff saved to https://phabricator.wikimedia.org/P23392 and previous config saved to /var/cache/conftool/dbconfig/20220328-102915-marostegui.json [10:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:22] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [10:33:51] (03PS3) 10Btullis: Add an alert for zero messages being generated by varnishkafka instances [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) [10:34:16] (03CR) 10Btullis: Add an alert for zero messages being generated by varnishkafka instances (035 comments) [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [10:37:06] (03CR) 10jerkins-bot: [V: 04-1] Add an alert for zero messages being generated by varnishkafka instances [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [10:37:47] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) >>! In T281249#7796947, @Marostegui wrote: > That's the main... [10:38:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298556)', diff saved to https://phabricator.wikimedia.org/P23393 and previous config saved to /var/cache/conftool/dbconfig/20220328-103828-marostegui.json [10:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:35] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [10:53:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P23394 and previous config saved to /var/cache/conftool/dbconfig/20220328-105333-marostegui.json [10:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:38] (03PS1) 10Ayounsi: Alertmanager: route alerts to site phab task [puppet] - 10https://gerrit.wikimedia.org/r/774437 (https://phabricator.wikimedia.org/T300836) [11:01:42] (03PS2) 10Ayounsi: Alertmanager: route DCops task alerts to sites project [puppet] - 10https://gerrit.wikimedia.org/r/774437 (https://phabricator.wikimedia.org/T300836) [11:04:57] (03PS3) 10Ayounsi: Alertmanager: route DCops task alerts to sites project [puppet] - 10https://gerrit.wikimedia.org/r/774437 (https://phabricator.wikimedia.org/T300836) [11:05:58] (03PS2) 10Giuseppe Lavagetto: Downgrade to use pyparsing 2.x, default version in buster/bullseye [software/conftool] - 10https://gerrit.wikimedia.org/r/774431 [11:06:00] (03PS2) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774432 [11:06:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:07:33] <_joe_> elukey: ^^ seen this ongoing since yesterday [11:07:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Downgrade to use pyparsing 2.x, default version in buster/bullseye [software/conftool] - 10https://gerrit.wikimedia.org/r/774431 (owner: 10Giuseppe Lavagetto) [11:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P23395 and previous config saved to /var/cache/conftool/dbconfig/20220328-110839-marostegui.json [11:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:41] (03Merged) 10jenkins-bot: Downgrade to use pyparsing 2.x, default version in buster/bullseye [software/conftool] - 10https://gerrit.wikimedia.org/r/774431 (owner: 10Giuseppe Lavagetto) [11:12:16] (03CR) 10Klausman: [C: 03+1] Refactor Calico's CNI plugin config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [11:12:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774432 (owner: 10Giuseppe Lavagetto) [11:14:05] (03CR) 10Klausman: [C: 03+1] Add helmfile config for Istio proxy sidecars (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [11:15:11] (03PS4) 10Ayounsi: Alertmanager: route DCops task alerts to sites project [puppet] - 10https://gerrit.wikimedia.org/r/774437 (https://phabricator.wikimedia.org/T300836) [11:15:26] (03Merged) 10jenkins-bot: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/774432 (owner: 10Giuseppe Lavagetto) [11:17:26] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/774446 (owner: 10L10n-bot) [11:19:47] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:21:45] (03PS1) 10David Caro: openstack.cinder: Add patch to the backups chunkeddriver [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) [11:23:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298556)', diff saved to https://phabricator.wikimedia.org/P23396 and previous config saved to /var/cache/conftool/dbconfig/20220328-112345-marostegui.json [11:23:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:23:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:50] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [11:23:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298556)', diff saved to https://phabricator.wikimedia.org/P23397 and previous config saved to /var/cache/conftool/dbconfig/20220328-112352-marostegui.json [11:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:17] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [11:25:13] !log installing Intel microcode updates 2022-02-07 on Bullseye [11:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:30] (03CR) 10Arturo Borrero Gonzalez: openstack.cinder: Add patch to the backups chunkeddriver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [11:30:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [11:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [11:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [11:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [11:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:46] !log depool cp2031 for reimage - T290005 [11:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:51] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:44:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298556)', diff saved to https://phabricator.wikimedia.org/P23398 and previous config saved to /var/cache/conftool/dbconfig/20220328-114451-marostegui.json [11:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:56] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [11:47:57] (03CR) 10Phedenskog: grafana: provision JSON datasource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [11:50:14] (03PS3) 10MMandere: site: Reimage cp2031 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773251 (https://phabricator.wikimedia.org/T290005) [11:51:36] (03PS1) 10Btullis: Use test coordinator for staging datahub deploy [puppet] - 10https://gerrit.wikimedia.org/r/774458 (https://phabricator.wikimedia.org/T301459) [11:52:56] (03CR) 10MMandere: [C: 03+2] site: Reimage cp2031 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773251 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:55:24] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp2031.codfw.wmnet with OS buster [11:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:32] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2031.codfw.wmnet with OS buster [11:56:19] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: introduce cookbook to build/deploy all k8s components [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773612 [11:56:26] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build/deplo code into a manager class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/774459 [11:57:42] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize deploy code into a class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773509 (owner: 10Arturo Borrero Gonzalez) [11:57:52] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: toolforge: k8s: factorize build code into a class [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/773510 (owner: 10Arturo Borrero Gonzalez) [11:59:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P23399 and previous config saved to /var/cache/conftool/dbconfig/20220328-115956-marostegui.json [12:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:10] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2196 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [12:08:34] here [12:08:43] here [12:09:17] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.7903 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:09:37] <_joe_> sigh [12:09:45] <_joe_> if I had to guess, it's changeprop again [12:10:21] <_joe_> yes we had a peak https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&viewPanel=28 [12:10:36] <_joe_> see the "transcludes" job [12:10:47] looks to be gently going back down again [12:13:22] what is changeprop here? [12:13:34] <_joe_> Emperor: can I answer later? [12:13:39] <_joe_> I was in the middle of lunch [12:13:48] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2031.codfw.wmnet with reason: host reimage [12:13:50] <_joe_> jayme: can you check how to reduce concurrency on changeprop? [12:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:59] _joe_: surely [12:14:17] there is a similar but not that high spike in mobileapps req/s [12:14:18] <_joe_> yeah the current instance will resolve soon [12:14:32] <_joe_> jayme: yes it's the same problem we saw yesterday [12:14:40] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.5244 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [12:14:43] ack. wanted to confirm [12:14:44] <_joe_> can you look into reducing concurrency for transcludes in changeprop? [12:14:49] _joe_: I can check, yes [12:14:50] <_joe_> I'll eat in the meantime :) [12:15:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P23400 and previous config saved to /var/cache/conftool/dbconfig/20220328-121501-marostegui.json [12:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:49] changeprop config ... keeps haunting me [12:16:27] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2031.codfw.wmnet with reason: host reimage [12:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:57] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:24:17] (03PS1) 10JMeybohm: Half the transclusion update concurrecy [deployment-charts] - 10https://gerrit.wikimedia.org/r/774462 [12:26:28] _joe_: --^ when you're back [12:26:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Half the transclusion update concurrecy [deployment-charts] - 10https://gerrit.wikimedia.org/r/774462 (owner: 10JMeybohm) [12:27:12] <_joe_> I literally sat down seconds after you sent your patch [12:29:08] (03CR) 10JMeybohm: [C: 03+2] Half the transclusion update concurrecy [deployment-charts] - 10https://gerrit.wikimedia.org/r/774462 (owner: 10JMeybohm) [12:29:11] great :) [12:30:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298556)', diff saved to https://phabricator.wikimedia.org/P23401 and previous config saved to /var/cache/conftool/dbconfig/20220328-123007-marostegui.json [12:30:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:30:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1148.eqiad.wmnet with reason: Maintenance [12:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:13] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [12:30:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298556)', diff saved to https://phabricator.wikimedia.org/P23402 and previous config saved to /var/cache/conftool/dbconfig/20220328-123015-marostegui.json [12:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:56] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:31:59] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/pcc-worker1003/34585/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/774437 (https://phabricator.wikimedia.org/T300836) (owner: 10Ayounsi) [12:31:59] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:57] ETOOFASTFORCI [12:34:06] (03Merged) 10jenkins-bot: Half the transclusion update concurrecy [deployment-charts] - 10https://gerrit.wikimedia.org/r/774462 (owner: 10JMeybohm) [12:34:28] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:34:31] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:29] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:55] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:44] 10SRE, 10Infrastructure-Foundations, 10netops: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10ayounsi) thanks for documenting it, and yes, I fully agree. We have BGP configured to the core-routers loopback in many different locations... [12:38:09] (03PS1) 10Urbanecm: throttle: Add rule for Czech Wikigap 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774466 (https://phabricator.wikimedia.org/T304836) [12:38:21] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2031.codfw.wmnet with OS buster [12:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:26] jouncebot: nowandnext [12:38:26] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [12:38:26] In 0 hour(s) and 21 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T1300) [12:38:29] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2031.codfw.wmnet with OS buster com... [12:38:36] deploying the above, should be quick [12:38:45] (03CR) 10Urbanecm: [C: 03+2] throttle: Add rule for Czech Wikigap 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774466 (https://phabricator.wikimedia.org/T304836) (owner: 10Urbanecm) [12:39:31] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [12:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:36] (03Merged) 10jenkins-bot: throttle: Add rule for Czech Wikigap 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774466 (https://phabricator.wikimedia.org/T304836) (owner: 10Urbanecm) [12:40:23] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [12:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:59] !log urbanecm@deploy1002 Synchronized wmf-config/throttle.php: 3ba524ddc4eb4f719c82064dc3ffb5872fd9c941: throttle: Add rule for Czech Wikigap 2022 (T304836) (duration: 00m 52s) [12:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:05] T304836: IP throttle lift request for Czech Wikigap 2022 in Brno - https://phabricator.wikimedia.org/T304836 [12:41:29] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Update backup of testwiki media on codfw [puppet] - 10https://gerrit.wikimedia.org/r/774378 (https://phabricator.wikimedia.org/T299764) (owner: 10Jcrespo) [12:41:45] * urbanecm done [12:43:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:43:53] !log Clear signup authentication throttle per https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold for 195.113.155.4 (T304836) [12:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:00] !log pool cp2031 with HAProxy as TLS termination layer - T290005 [12:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:04] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:44:06] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/774446 (owner: 10L10n-bot) [12:44:25] !log installing Intel microcode updates 2022-02-07 on Buster [12:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:44:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:41] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:50:09] (03PS4) 10Btullis: Add an alert for zero messages being generated by varnishkafka instances [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) [12:50:24] !log depool cp2029 for reimage - T290005 [12:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:29] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:51:13] (03CR) 10Elukey: Add helmfile config for Istio proxy sidecars (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773565 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [12:52:04] (03PS3) 10MMandere: site: Reimage cp2029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773252 (https://phabricator.wikimedia.org/T290005) [12:52:25] (03PS35) 10Elukey: Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) [12:52:43] (03CR) 10Elukey: Refactor Calico's CNI plugin config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772909 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [12:55:32] (03CR) 10MMandere: [C: 03+2] site: Reimage cp2029 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/773252 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [12:56:01] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/772811 (https://phabricator.wikimedia.org/T300270) (owner: 10AikoChou) [12:57:27] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp2029.codfw.wmnet with OS buster [12:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:35] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp2029.codfw.wmnet with OS buster [12:57:40] (03CR) 10Joal: [C: 03+1] "LGTM in term of webrequest schema - VCL code is out of scope for me :)" [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [13:00:03] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: Dear deployers, time to do the UTC afternoon backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T1300). [13:00:05] Lucas_WMDE, DannyS712, koi, and zabe: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] * urbanecm around [13:00:17] o/ [13:00:19] I can deploy [13:00:21] o/ [13:00:24] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [13:00:30] Lucas_WMDE: go ahead -- I'm around if needed. [13:00:30] o/ [13:00:41] ok, thanks [13:00:49] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.3 point update - https://phabricator.wikimedia.org/T304599 (10MoritzMuehlenhoff) [13:00:50] I see we have seven patches, can’t guarantee we’ll manage all of them [13:01:02] (I intend to still sync the phpcs cleanups) [13:01:26] (03PS2) 10Lucas Werkmeister (WMDE): Write "unexpectedUnconnectedPage" page prop everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773239 [13:01:46] probably can be synced all at once though (most of the files it changes aren't actually read by prod) [13:01:59] hm, maybe [13:02:46] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Write "unexpectedUnconnectedPage" page prop everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773239 (owner: 10Lucas Werkmeister (WMDE)) [13:03:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:30] (03Merged) 10jenkins-bot: Write "unexpectedUnconnectedPage" page prop everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773239 (owner: 10Lucas Werkmeister (WMDE)) [13:05:28] seems to work, syncing [13:07:17] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:773239|Write "unexpectedUnconnectedPage" page prop everywhere]] (duration: 00m 56s) [13:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:08:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:08] (03CR) 10Andrew Bogott: openstack.cinder: Add patch to the backups chunkeddriver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [13:09:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:05] (03PS4) 10Lucas Werkmeister (WMDE): phpcs: enable and fix SingleSpaceBeforeSingleLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773863 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:10:12] (03CR) 10Andrew Bogott: "I'd also appreciate having a link to the upstream patch in the puppet patch (even though it's already linked in the phab ticket)" [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [13:10:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Rebased with a slight tweak to the number of asterisks so end of the line lines up :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773863 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:11:47] (03Merged) 10jenkins-bot: phpcs: enable and fix SingleSpaceBeforeSingleLineComment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773863 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:13:17] DannyS712: are you around? [13:13:28] (I’m assuming there’s nothing to test, but I should still ask) [13:13:55] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:773863|phpcs: enable and fix SingleSpaceBeforeSingleLineComment (T171115)]] (phpcs.xml will be synced with next patch) (duration: 01m 01s) [13:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:00] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [13:14:06] (03PS4) 10Lucas Werkmeister (WMDE): phpcs: enable passing rule UnusedGlobalVariables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773864 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:14:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] phpcs: enable passing rule UnusedGlobalVariables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773864 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:14:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:32] (03Merged) 10jenkins-bot: phpcs: enable passing rule UnusedGlobalVariables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773864 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:15:49] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp2029.codfw.wmnet with reason: host reimage [13:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:16:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:05] Lucas_WMDE: can you ping me when done deploying please? [13:17:09] ok [13:17:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:55] !log lucaswerkmeister-wmde@deploy1002 Synchronized phpcs.xml: Config: [[gerrit:773864|phpcs: enable passing rule UnusedGlobalVariables (T171115)]] (includes phpcs.xml change from previous sync) (duration: 00m 56s) [13:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:22] (03PS5) 10Lucas Werkmeister (WMDE): phpcs: narrow some exclusions only needed for cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773865 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:18:25] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp2029.codfw.wmnet with reason: host reimage [13:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:43] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] phpcs: narrow some exclusions only needed for cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773865 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:19:03] (03CR) 10Ayounsi: "A couple comments but overall lgtm! I also checked that BFD doesn't seem to use high ports as destination." [homer/public] - 10https://gerrit.wikimedia.org/r/773299 (https://phabricator.wikimedia.org/T304553) (owner: 10Cathal Mooney) [13:20:16] (03Merged) 10jenkins-bot: phpcs: narrow some exclusions only needed for cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773865 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:21:52] (03PS3) 10Cathal Mooney: Add template to configure IPv6 RAs on CRs and L3 Switches [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) [13:21:58] !log lucaswerkmeister-wmde@deploy1002 Synchronized tests/cirrusTest.php: Config: [[gerrit:773865|phpcs: narrow some exclusions only needed for cirrusTest.php (T171115)]] (1/2) (duration: 00m 56s) [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:03] T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115 [13:22:15] (03PS3) 10Lucas Werkmeister (WMDE): phpcs: clean up MWConfigCacheGenerator and enable rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773966 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [13:22:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:03] !log lucaswerkmeister-wmde@deploy1002 Synchronized phpcs.xml: Config: [[gerrit:773865|phpcs: narrow some exclusions only needed for cirrusTest.php (T171115)]] (2/2) (duration: 00m 55s) [13:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:26] (03CR) 10Cathal Mooney: "Thanks for the feedback @ayounsi I've made those changes now." [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:23:29] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) >>! In T281249#7810230, @Ladsgroup wrote: > > Thanks. I wi... [13:24:17] I’ll skip DannyS712’s last change for now, I don’t want to verify right now whether all those types are correct [13:24:29] let’s continue with koi [13:24:39] ok [13:25:21] urbanecm: just to check, do throttling exceptions need some kind of approval on Phabricator or is it enough if it looks good to me? [13:25:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:25:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:43] Lucas_WMDE: no need for an approval process, assuming the event is organized by somewhat-trusted user [13:25:50] ok thanks [13:26:02] (03CR) 10Ayounsi: [C: 03+1] "1 nit, LGTM otherwise!" [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:26:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:29] EditingOedipa was already granted account creator temporarily a few years ago, I think that’s good enough :) [13:26:36] yup [13:27:27] (03CR) 10Lucas Werkmeister (WMDE): Throttle: Add rule for Bard College class project on enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774023 (https://phabricator.wikimedia.org/T304687) (owner: 10Stang) [13:27:32] hm, found one mismatch though [13:27:43] I’ll check the subnet/mask format whether /23 or /24 seems more correct [13:28:21] (03CR) 10Lucas Werkmeister (WMDE): Throttle: Add rule for Bard College class project on enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774023 (https://phabricator.wikimedia.org/T304687) (owner: 10Stang) [13:28:26] nevermind [13:29:05] (03CR) 10Stang: Throttle: Add rule for Bard College class project on enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774023 (https://phabricator.wikimedia.org/T304687) (owner: 10Stang) [13:29:19] but now Gerrit says it has a merge conflict during rebase -.- [13:29:24] koi: do you want to resolve it or should I? [13:29:47] please tell me how to do so, I would like to :) [13:29:56] okay [13:30:14] you have the change in a local clone of mediawiki-config.git, right? [13:30:22] yep [13:30:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298556)', diff saved to https://phabricator.wikimedia.org/P23403 and previous config saved to /var/cache/conftool/dbconfig/20220328-133029-marostegui.json [13:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:35] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [13:30:35] then you need to switch to the master branch and run git pull [13:31:06] (`git pull --rebase`, to be on the safe side, maybe) [13:31:14] got it, doing [13:31:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:26] Lucas_WMDE, done, and then? [13:32:26] (03PS1) 10Jcrespo: mediabackup: Update s4 backup in codfw [puppet] - 10https://gerrit.wikimedia.org/r/774474 (https://phabricator.wikimedia.org/T299764) [13:32:40] if you had the change on a separate branch, then switch back to that branch and `git rebase master` [13:33:01] if you had the change on master (I realized after my earlier message that this might be an option), then the rebase should already be done [13:34:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:34:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:46] (03PS4) 10Cathal Mooney: Add template to configure IPv6 RAs on CRs and L3 Switches [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) [13:34:47] (I tried it out on my end and it looks like there are no actual conflicts) [13:34:50] wait, this should be done in the branch after I run `git review -d [id]`? [13:34:59] or the normal branch [13:35:07] I’m not sure [13:35:14] if you ran `git review -d`, then probably in that branch [13:35:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:36] I assumed you wouldn’t need that command since you’d already have the change locally, but it shouldn’t hurt either [13:35:37] There is https://www.mediawiki.org/wiki/Gerrit/Advanced_usage#Manually_rebase_(on_a_branch) ;) [13:36:01] (03CR) 10Ayounsi: [C: 03+1] Add template to configure IPv6 RAs on CRs and L3 Switches [homer/public] - 10https://gerrit.wikimedia.org/r/773587 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:37:53] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Update s4 backup in codfw [puppet] - 10https://gerrit.wikimedia.org/r/774474 (https://phabricator.wikimedia.org/T299764) (owner: 10Jcrespo) [13:38:37] 0 0 I give up [13:38:58] it said "It seems that there is already a rebase-apply directory, and I wonder if you are in the middle of another rebase. " [13:39:08] o_O [13:39:14] okay, I can push the rebase [13:39:29] and you can probably run `git rebase --abort` to discard whatever is currently going on [13:39:39] (03PS4) 10Lucas Werkmeister (WMDE): Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:39:41] thanks a lot, rebase kind of hard to me 0 0 [13:40:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Throttle: Add rule for Bard College class project on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774023 (https://phabricator.wikimedia.org/T304687) (owner: 10Stang) [13:40:16] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2029.codfw.wmnet with OS buster [13:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:27] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp2029.codfw.wmnet with OS buster com... [13:40:56] bah, my local master wasn’t actually latest either [13:41:00] let’s see if it still merges [13:41:01] (03CR) 10jerkins-bot: [V: 04-1] Throttle: Add rule for Bard College class project on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774023 (https://phabricator.wikimedia.org/T304687) (owner: 10Stang) [13:41:33] wait, I think I rebased the wrong change [13:41:35] sorry [13:41:47] * Lucas_WMDE starts from scratch [13:41:56] ok, there is an actual conflict [13:42:21] (03PS2) 10Lucas Werkmeister (WMDE): Throttle: Add rule for Bard College class project on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774023 (https://phabricator.wikimedia.org/T304687) (owner: 10Stang) [13:42:36] yeah I see urbanecm add another throttle rule today [13:42:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Sorry, I rebased the wrong change earlier. Should be good to go now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774023 (https://phabricator.wikimedia.org/T304687) (owner: 10Stang) [13:42:53] sorry for making changes conflict :) [13:43:41] (03Merged) 10jenkins-bot: Throttle: Add rule for Bard College class project on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774023 (https://phabricator.wikimedia.org/T304687) (owner: 10Stang) [13:44:02] alright, the change is on mwdebug1001 now [13:44:09] koi: do you know if it can be tested? [13:44:17] just wonder how to test this [13:44:32] I feel like it can’t really be tested [13:44:40] except checking that nothing obvious breaks [13:45:24] poke urbanecm, how to test a throttle rule here? [13:45:30] koi: you can't [13:45:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P23404 and previous config saved to /var/cache/conftool/dbconfig/20220328-134534-marostegui.json [13:45:37] ok, then I’ll just sync [13:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:41] ok [13:45:51] you can only test it in the set timeframe [13:45:54] Lucas_WMDE sorry I'm late, forgot to set a reminder for this [13:45:59] (and from the IP) [13:46:09] I'm here if my PHPCS patches can still be deployed [13:46:17] I deployed the first three [13:46:37] I think the last one (param types) should be reviewed (+1) outside of the window, I didn’t want to spend time looking into the correctness there [13:46:44] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/throttle.php: Config: [[gerrit:774023|Throttle: Add rule for Bard College class project on enwiki (T304687)]] (duration: 00m 54s) [13:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:49] T304687: IP Cap Lift March 31, 2022 - https://phabricator.wikimedia.org/T304687 [13:47:06] (03PS5) 10Lucas Werkmeister (WMDE): Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:47:19] ah, okay, thanks. For the param types I can spend some time documenting where I got each of the actual types from if needed [13:47:28] that sounds great [13:49:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:49:44] (03Merged) 10jenkins-bot: Migrate $wmfAllServices to $wmgAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773608 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [13:50:11] zabe: the wmgAllServices change is on mwdebug1001, please test it [13:50:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:50:21] (03PS4) 10JMeybohm: Allow to specify additional gateway hosts without overriding the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/773805 (https://phabricator.wikimedia.org/T290966) [13:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:51:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:01] Lucas_WMDE, nothing seems to break and logstash is clear, lgtm [13:52:11] alright, thanks [13:52:15] (03PS1) 10Jcrespo: mariabackup: Prefer db2099 mw db (backup source) for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/774477 (https://phabricator.wikimedia.org/T299764) [13:52:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:42] (03PS5) 10Ayounsi: Added optional ability to enable uRPF filtering on arbitary CR ints [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) (owner: 10Cathal Mooney) [13:53:42] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CirrusSearch-production.php: Config: [[gerrit:773608|Migrate $wmfAllServices to $wmgAllServices (T45956)]] (1/5) (duration: 00m 51s) [13:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:48] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [13:54:56] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:773608|Migrate $wmfAllServices to $wmgAllServices (T45956)]] (2/5) (duration: 00m 56s) [13:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:05] (03PS1) 10Ayounsi: Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) [13:55:07] (03PS1) 10Ayounsi: Apply strict uRPF to the analytics vlans [homer/public] - 10https://gerrit.wikimedia.org/r/774479 (https://phabricator.wikimedia.org/T298087) [13:55:48] (03CR) 10jerkins-bot: [V: 04-1] Apply strict uRPF to the analytics vlans [homer/public] - 10https://gerrit.wikimedia.org/r/774479 (https://phabricator.wikimedia.org/T298087) (owner: 10Ayounsi) [13:55:55] (03CR) 10jerkins-bot: [V: 04-1] Apply strict uRPF to the cloud-hosts vlan [homer/public] - 10https://gerrit.wikimedia.org/r/774478 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [13:56:10] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/filebackend.php: Config: [[gerrit:773608|Migrate $wmfAllServices to $wmgAllServices (T45956)]] (3/5) (duration: 00m 51s) [13:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:26] (03CR) 10Ayounsi: Added optional ability to enable uRPF filtering on arbitary CR ints (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/702446 (https://phabricator.wikimedia.org/T285461) (owner: 10Cathal Mooney) [13:57:35] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CirrusSearch-labs.php: Config: [[gerrit:773608|Migrate $wmfAllServices to $wmgAllServices (T45956)]] (4/5, prod noop) (duration: 01m 07s) [13:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:53] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:773608|Migrate $wmfAllServices to $wmgAllServices (T45956)]] (5/5, prod noop) (duration: 01m 04s) [13:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:59] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [13:59:01] taavi: all done [13:59:09] ok, thanks! [13:59:11] jouncebot: nowandnext [13:59:11] For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T1300) [13:59:11] In 1 hour(s) and 30 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T1530) [13:59:11] (03CR) 10Vivian Rook: [C: 03+2] Update codfw1dev cloudservices openstack [puppet] - 10https://gerrit.wikimedia.org/r/773806 (https://phabricator.wikimedia.org/T304702) (owner: 10Vivian Rook) [13:59:21] heh, perfect timing [13:59:37] there doens't seem to be anything happening directly afterwards, so I'll sneak in a quick security patch [13:59:46] (03PS1) 10Ottomata: Add thirdparty/conda component to reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/774481 (https://phabricator.wikimedia.org/T304450) [14:00:00] ok [14:00:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:00:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P23405 and previous config saved to /var/cache/conftool/dbconfig/20220328-140039-marostegui.json [14:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:51] (03PS2) 10David Caro: openstack.cinder: Add patch to the backups chunkeddriver [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) [14:01:53] (03CR) 10David Caro: openstack.cinder: Add patch to the backups chunkeddriver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [14:04:28] (03CR) 10Elukey: [C: 03+2] ml-services: update draft/article quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/772811 (https://phabricator.wikimedia.org/T300270) (owner: 10AikoChou) [14:06:28] !log deploy security patch for T226212 [14:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:32] * taavi done [14:06:45] (03PS3) 10Alexandros Kosiaris: decommission kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/771850 (https://phabricator.wikimedia.org/T303044) [14:06:53] (03CR) 10Ssingh: [C: 03+1] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/773780 (owner: 10Muehlenhoff) [14:06:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:07:08] !log pool cp2029 with HAProxy as TLS termination layer - T290005 [14:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:13] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:10:25] (03PS41) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [14:11:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: remove request* objects from the schema [puppet] - 10https://gerrit.wikimedia.org/r/774421 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [14:11:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:11:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:57] (03PS2) 10Giuseppe Lavagetto: conftool: remove request* objects from the schema [puppet] - 10https://gerrit.wikimedia.org/r/774421 (https://phabricator.wikimedia.org/T302471) [14:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:22] !log decommission kubernetes100[1-4]. T303044 [14:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:26] T303044: decommission kubernetes100[1-4] - https://phabricator.wikimedia.org/T303044 [14:12:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] decommission kubernetes[12]00[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/771850 (https://phabricator.wikimedia.org/T303044) (owner: 10Alexandros Kosiaris) [14:13:27] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission for hosts kubernetes[1001-1004].eqiad.wmnet [14:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:09] !log akosiaris@cumin1001 START - Cookbook sre.hosts.decommission for hosts kubernetes[2001-2004].codfw.wmnet [14:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298556)', diff saved to https://phabricator.wikimedia.org/P23406 and previous config saved to /var/cache/conftool/dbconfig/20220328-141544-marostegui.json [14:15:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:15:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:50] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [14:15:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298556)', diff saved to https://phabricator.wikimedia.org/P23407 and previous config saved to /var/cache/conftool/dbconfig/20220328-141552-marostegui.json [14:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Looks good to me, there don’t seem to be any other references to $wmfAllServices left:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [14:17:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: remove request* objects from sync [puppet] - 10https://gerrit.wikimedia.org/r/774422 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [14:18:38] (03CR) 10Jcrespo: [C: 03+2] mariabackup: Prefer db2099 mw db (backup source) for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/774477 (https://phabricator.wikimedia.org/T299764) (owner: 10Jcrespo) [14:19:00] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:07] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [14:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:15] (03PS2) 10Giuseppe Lavagetto: conftool: remove request* objects from sync [puppet] - 10https://gerrit.wikimedia.org/r/774422 (https://phabricator.wikimedia.org/T302471) [14:20:29] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_moni [14:20:29] 3BGP_status [14:20:42] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:20:46] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [14:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] (KubernetesCalicoDown) firing: (2) kubernetes1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:21:05] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:18] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:21:56] (03CR) 10David Caro: openstack.cinder: Add patch to the backups chunkeddriver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [14:22:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo the variables duplication for testing" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:24:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/773801 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis) [14:25:01] (03CR) 10Andrew Bogott: [C: 04-1] "looks good, one issue inline" [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [14:25:58] (KubernetesCalicoDown) firing: (8) kubernetes1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:26:30] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitor [14:26:30] P_status [14:26:43] (03PS1) 10Kormat: orchestrator: Switch to db1115 as backend. [puppet] - 10https://gerrit.wikimedia.org/r/774485 (https://phabricator.wikimedia.org/T301315) [14:27:01] (03PS3) 10David Caro: openstack.cinder: Add patch to the backups chunkeddriver [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) [14:27:15] (03CR) 10jerkins-bot: [V: 04-1] orchestrator: Switch to db1115 as backend. [puppet] - 10https://gerrit.wikimedia.org/r/774485 (https://phabricator.wikimedia.org/T301315) (owner: 10Kormat) [14:27:51] (03PS4) 10David Caro: openstack.cinder: Add patch to the backups chunkeddriver [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) [14:28:00] (03CR) 10David Caro: openstack.cinder: Add patch to the backups chunkeddriver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [14:28:11] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:01] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34587/console" [puppet] - 10https://gerrit.wikimedia.org/r/774485 (https://phabricator.wikimedia.org/T301315) (owner: 10Kormat) [14:29:03] (03CR) 10Andrew Bogott: [C: 03+1] "gj hunting down this upstream bug!" [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [14:29:36] (03PS2) 10Kormat: orchestrator: Switch to db1115 as backend. [puppet] - 10https://gerrit.wikimedia.org/r/774485 (https://phabricator.wikimedia.org/T301315) [14:30:34] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34589/console" [puppet] - 10https://gerrit.wikimedia.org/r/774485 (https://phabricator.wikimedia.org/T301315) (owner: 10Kormat) [14:30:36] (03PS2) 10Ottomata: Add thirdparty/conda component to reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/774481 (https://phabricator.wikimedia.org/T304450) [14:30:40] (03Abandoned) 10Andrew Bogott: nrpe_local.cfg.erb: increase nrpe timeout to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/764464 (owner: 10Andrew Bogott) [14:30:58] (KubernetesCalicoDown) firing: (8) kubernetes1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:32:00] (03CR) 10MVernon: [C: 03+2] swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:32:14] (03CR) 10David Caro: [C: 03+2] openstack.cinder: Add patch to the backups chunkeddriver [puppet] - 10https://gerrit.wikimedia.org/r/774454 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [14:33:01] (03CR) 10Herron: admin: add tsepothoabala to deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [14:33:24] (03PS1) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [14:34:18] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34590/console" [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:35:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298556)', diff saved to https://phabricator.wikimedia.org/P23408 and previous config saved to /var/cache/conftool/dbconfig/20220328-143550-marostegui.json [14:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:55] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [14:35:58] (KubernetesCalicoDown) firing: (8) kubernetes1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:36:12] 10SRE, 10Thumbor, 10Traffic, 10affects-Kiwix-and-openZIM: Unjustified HTTP 429 responses lead to "endless" Wikipedia scrapes - https://phabricator.wikimedia.org/T304814 (10AntiCompositeNumber) 429 is returned when the thumbnail hits one of four ratelimits (see https://wikitech.wikimedia.org/wiki/Thumbor#Th... [14:36:33] (03PS2) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [14:37:14] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34591/console" [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:37:20] 10SRE, 10Thumbor, 10Traffic, 10affects-Kiwix-and-openZIM: MWoffliner scrapes slowed down by Thumbor failure throttling 429s - https://phabricator.wikimedia.org/T304814 (10AntiCompositeNumber) [14:38:08] (03PS3) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [14:39:08] (03PS5) 10Filippo Giunchedi: Alertmanager: route DCops task alerts to sites project [puppet] - 10https://gerrit.wikimedia.org/r/774437 (https://phabricator.wikimedia.org/T300836) (owner: 10Ayounsi) [14:40:58] (KubernetesCalicoDown) firing: (8) kubernetes1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:41:57] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10herron) [14:42:13] (03PS1) 10MVernon: swift::ring_manager typo in #! [puppet] - 10https://gerrit.wikimedia.org/r/774494 (https://phabricator.wikimedia.org/T265117) [14:43:29] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/774494 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:43:40] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10herron) 05In progress→03Resolved Resolving as the near-term access requested in the description has been provisioned, please reopen if any follow up is needed. Thanks! [14:45:48] !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs*,name=eqiad [14:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:53] (03PS6) 10Filippo Giunchedi: Alertmanager: route DCops task alerts to sites project [puppet] - 10https://gerrit.wikimedia.org/r/774437 (https://phabricator.wikimedia.org/T300836) (owner: 10Ayounsi) [14:45:58] (KubernetesCalicoDown) resolved: (8) kubernetes1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:46:08] !log bking@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs-internal,name=eqiad [14:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:59] !log 'bking@cumin1001 repooling wdqs services in IAD ref T302494' [14:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:03] T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol - https://phabricator.wikimedia.org/T302494 [14:47:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! easy enough 😊" [puppet] - 10https://gerrit.wikimedia.org/r/774494 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:47:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/774481 (https://phabricator.wikimedia.org/T304450) (owner: 10Ottomata) [14:47:58] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:06] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P23409 and previous config saved to /var/cache/conftool/dbconfig/20220328-145055-marostegui.json [14:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:49] (03CR) 10Papaul: [C: 03+1] labs-in filter: remove PXE term [homer/public] - 10https://gerrit.wikimedia.org/r/769657 (owner: 10Ayounsi) [14:53:33] (03PS1) 10Muehlenhoff: Add data.yaml entry for hartman [puppet] - 10https://gerrit.wikimedia.org/r/774495 (https://phabricator.wikimedia.org/T304120) [14:54:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/pcc-worker1002/34593/" [puppet] - 10https://gerrit.wikimedia.org/r/774437 (https://phabricator.wikimedia.org/T300836) (owner: 10Ayounsi) [14:54:36] 10SRE, 10Thumbor, 10Traffic, 10affects-Kiwix-and-openZIM: MWoffliner scrapes slowed down by Thumbor failure throttling 429s - https://phabricator.wikimedia.org/T304814 (10AntiCompositeNumber) The actual failure for this thumbnail is ` ImageMagickException: Failed to convert image convert: IDAT: invalid di... [14:57:22] (03CR) 10Muehlenhoff: [C: 03+2] Add data.yaml entry for hartman [puppet] - 10https://gerrit.wikimedia.org/r/774495 (https://phabricator.wikimedia.org/T304120) (owner: 10Muehlenhoff) [14:57:24] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [15:02:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2027.codfw.wmnet with OS buster [15:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:06] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2027.codfw.wmnet with OS buster [15:03:30] 10SRE, 10Thumbor, 10Traffic, 10affects-Kiwix-and-openZIM: MWoffliner scrapes slowed down by Thumbor failure throttling 429s - https://phabricator.wikimedia.org/T304814 (10Kelson) >>! In T304814#7810926, @AntiCompositeNumber wrote: > 429 is returned when the thumbnail hits one of four ratelimits (see https:... [15:03:38] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 74.25 ms [15:05:34] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [15:06:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P23410 and previous config saved to /var/cache/conftool/dbconfig/20220328-150600-marostegui.json [15:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:06:26] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01067 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:06:58] Emperor: FYI Icinga config is broken due to [15:07:00] Error: Contact group 'data-persistence' specified in service 'Check unit status of swift_ring_manager' for host 'ms-fe1009' [15:07:09] is not defined anywhere! [15:07:27] * volans omitted some intermediate part of the message [15:09:48] (03PS10) 10BBlack: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [15:11:24] sadness [15:11:39] (03CR) 10MVernon: [C: 03+2] swift::ring_manager typo in #! [puppet] - 10https://gerrit.wikimedia.org/r/774494 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:11:51] !log imported libapache2-mod-auth-cas 1.2-1+wmf10u2 to apt.wikimedia.org/buster-wikimedia [15:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:10] (03CR) 10Ottomata: [C: 03+2] Add thirdparty/conda component to reprepro updates [puppet] - 10https://gerrit.wikimedia.org/r/774481 (https://phabricator.wikimedia.org/T304450) (owner: 10Ottomata) [15:12:13] (03CR) 10BBlack: [C: 03+2] Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [15:13:20] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) a:05Papaul→03hnowlan @hnowlan can you please check the partman recipe when done assign the task back to me. Thanks [15:13:23] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10Papaul) [15:15:04] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubernetes[2001-2004].codfw.wmnet [15:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:43] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubernetes[1001-1004].eqiad.wmnet [15:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:46] (03CR) 10Filippo Giunchedi: [C: 03+2] Alertmanager: route DCops task alerts to sites project [puppet] - 10https://gerrit.wikimedia.org/r/774437 (https://phabricator.wikimedia.org/T300836) (owner: 10Ayounsi) [15:20:51] (03PS1) 10Zabe: Start writing to $wmgLocalServices the same value as to $wmfLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774497 (https://phabricator.wikimedia.org/T45956) [15:21:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298556)', diff saved to https://phabricator.wikimedia.org/P23411 and previous config saved to /var/cache/conftool/dbconfig/20220328-152105-marostegui.json [15:21:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [15:21:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [15:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:11] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [15:21:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T298556)', diff saved to https://phabricator.wikimedia.org/P23412 and previous config saved to /var/cache/conftool/dbconfig/20220328-152114-marostegui.json [15:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:24] (03PS1) 10MVernon: swift::ring_manager update contact group [puppet] - 10https://gerrit.wikimedia.org/r/774498 (https://phabricator.wikimedia.org/T265117) [15:21:44] volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/774498 I think? [15:23:16] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T304849 (10phaultfinder) [15:23:26] !log imported libapache2-mod-auth-cas 1.2-1+wmf11u2 to apt.wikimedia.org/bullseye-wikimedia [15:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:39] (03PS2) 10MVernon: swift::ring_manager update contact group [puppet] - 10https://gerrit.wikimedia.org/r/774498 (https://phabricator.wikimedia.org/T265117) [15:24:02] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T304502 (10herron) 05Stalled→03Invalid Hello, I'll close this as invalid for now since the task will need to specify what access/group is being requested, and an approving party, in order to move for... [15:24:44] 10ops-eqiad, 10decommission-hardware: decommission kubernetes100[1-4] - https://phabricator.wikimedia.org/T303044 (10akosiaris) [15:24:45] Emperor: checking [15:25:18] 10ops-codfw, 10decommission-hardware: decommission kubernetes200[1-4] - https://phabricator.wikimedia.org/T303045 (10akosiaris) [15:26:06] (03PS1) 10Zabe: Migrate $wmfLocalServices to $wmgLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774499 (https://phabricator.wikimedia.org/T45956) [15:26:08] (03PS1) 10Zabe: Stop writing to $wmfLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774500 (https://phabricator.wikimedia.org/T45956) [15:29:45] (03PS2) 10Zabe: Migrate $wmfLocalServices to $wmgLocalServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774499 (https://phabricator.wikimedia.org/T45956) [15:29:59] (03CR) 10Volans: [C: 03+1] "LGTM to get this alert on the -databases IRC channel instead of the -operations one." [puppet] - 10https://gerrit.wikimedia.org/r/774498 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:30:05] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T1530). [15:30:08] 10SRE, 10ops-eqiad: Eqiad: asw2-a-eqiad:xe-2/0/40 interface up with no description - https://phabricator.wikimedia.org/T304807 (10ayounsi) Alright, we fixed the above limitation. See T304849. [15:30:31] (03CR) 10MVernon: [C: 03+2] swift::ring_manager update contact group [puppet] - 10https://gerrit.wikimedia.org/r/774498 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [15:30:49] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T304849 (10ayounsi) [15:31:14] 10SRE, 10ops-eqiad: Eqiad: asw2-a-eqiad:xe-2/0/40 interface up with no description - https://phabricator.wikimedia.org/T304807 (10ayounsi) [15:31:18] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T304849 (10ayounsi) [15:32:22] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T304849 (10ayounsi) [15:32:26] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T304849 (10ayounsi) [15:33:25] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774501 (https://phabricator.wikimedia.org/T128546) [15:34:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2027.codfw.wmnet with OS buster [15:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:47] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase2027 - https://phabricator.wikimedia.org/T301399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host restbase2027.codfw.wmnet with OS buster executed with errors: - restbase20... [15:34:54] Emperor: if you have merged the fix I'll run puppet on alert1001 [15:35:15] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774501 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:35:31] (03PS1) 10Filippo Giunchedi: alertmanager: add eqsin to sites for dcops tasks [puppet] - 10https://gerrit.wikimedia.org/r/774502 [15:35:36] XioNoX: ^ [15:36:00] volans: fix merged yes, thanks :) [15:36:14] ack, running puppet [15:36:23] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774501 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:36:23] godog: ? (I have some bots /ignored) [15:37:00] XioNoX: lol, https://gerrit.wikimedia.org/r/c/operations/puppet/+/774502 [15:37:12] that was for wikibugs yeah [15:37:24] (03CR) 10Ayounsi: [C: 03+1] alertmanager: add eqsin to sites for dcops tasks [puppet] - 10https://gerrit.wikimedia.org/r/774502 (owner: 10Filippo Giunchedi) [15:37:30] :) [15:37:30] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add eqsin to sites for dcops tasks [puppet] - 10https://gerrit.wikimedia.org/r/774502 (owner: 10Filippo Giunchedi) [15:37:48] dunno how people stand this channel with all the bot traffic [15:37:55] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (🚂🧪 Trainsperiment Week): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10herron) Removing from the sre access request queue while the details of the request are being clarified. Please r... [15:38:17] I make bot traffic show up as NOTICE (as it should be) [15:38:26] (03PS1) 10Bartosz Dziewoński: Disable backtick sequence in ve-mw while conflict with Catalan is investigated [extensions/VisualEditor] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774400 (https://phabricator.wikimedia.org/T304804) [15:38:33] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:774501| Bumping portals to master (T128546)]] (duration: 00m 54s) [15:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:38] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:39:05] my client marks bot traffic in grey, and doesn't mark the channel as having content in on bot traffic [15:39:34] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:774501| Bumping portals to master (T128546)]] (duration: 01m 00s) [15:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:49] (by creative abuse of erc-fools and erc-track-faces-priority-list) [15:41:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298556)', diff saved to https://phabricator.wikimedia.org/P23413 and previous config saved to /var/cache/conftool/dbconfig/20220328-154117-marostegui.json [15:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:23] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [15:41:38] (03CR) 10Volans: "post-merge -1, this change uses the same description for 2 checks on the same host on Icinga, that is considered duplicated and only the f" [puppet] - 10https://gerrit.wikimedia.org/r/773571 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:43:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:44:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:41] (03CR) 10Volans: "post-merge -1, this change uses the same description for 2 checks on the same host on Icinga, that is considered duplicated and only the f" [puppet] - 10https://gerrit.wikimedia.org/r/773257 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [15:48:09] (03PS1) 10Volans: icinga: fix https expiry check [puppet] - 10https://gerrit.wikimedia.org/r/774506 [15:48:40] godog: if you have a sec ^^^ to remove duplicates in icinga config [15:50:50] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@b5b63c3]: (no justification provided) [15:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:59] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@b5b63c3]: (no justification provided) (duration: 02m 09s) [15:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:05] volans: oof, checking [15:53:19] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: fix https expiry check [puppet] - 10https://gerrit.wikimedia.org/r/774506 (owner: 10Volans) [15:53:23] volans: LGTM, thank you [15:53:26] thx [15:53:33] (03CR) 10Volans: [C: 03+2] icinga: fix https expiry check [puppet] - 10https://gerrit.wikimedia.org/r/774506 (owner: 10Volans) [15:53:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300775)', diff saved to https://phabricator.wikimedia.org/P23414 and previous config saved to /var/cache/conftool/dbconfig/20220328-155340-marostegui.json [15:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:47] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [15:55:06] (03PS1) 10Ottomata: reprepro updates - set Name: thirdparty/conda [puppet] - 10https://gerrit.wikimedia.org/r/774508 (https://phabricator.wikimedia.org/T304450) [15:55:25] (03PS1) 10Volans: swift::ring_manager: fix Icinga contactgroup name [puppet] - 10https://gerrit.wikimedia.org/r/774509 (https://phabricator.wikimedia.org/T265117) [15:55:51] (03CR) 10Ottomata: [C: 03+2] reprepro updates - set Name: thirdparty/conda [puppet] - 10https://gerrit.wikimedia.org/r/774508 (https://phabricator.wikimedia.org/T304450) (owner: 10Ottomata) [15:56:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P23415 and previous config saved to /var/cache/conftool/dbconfig/20220328-155622-marostegui.json [15:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:29] (03CR) 10MVernon: [C: 03+1] "Thanks! Apologies for extra bonus faff." [puppet] - 10https://gerrit.wikimedia.org/r/774509 (https://phabricator.wikimedia.org/T265117) (owner: 10Volans) [15:57:38] (03CR) 10Elukey: [C: 03+1] Allow to specify additional gateway hosts without overriding the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/773805 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:58:23] (03CR) 10Volans: [C: 03+2] swift::ring_manager: fix Icinga contactgroup name [puppet] - 10https://gerrit.wikimedia.org/r/774509 (https://phabricator.wikimedia.org/T265117) (owner: 10Volans) [16:04:02] (03CR) 10Elukey: "Left a couple of comments but the settings look good afaics, I am only wondering if modules/role/manifests/ml_k8s/master/staging.pp should" [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:05:54] (03CR) 10JMeybohm: [C: 03+2] Allow to specify additional gateway hosts without overriding the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/773805 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:08:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P23416 and previous config saved to /var/cache/conftool/dbconfig/20220328-160845-marostegui.json [16:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:25] (03CR) 10Volans: "post-merge -1, this one too generated a duplicated description, patch incoming." [puppet] - 10https://gerrit.wikimedia.org/r/773249 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [16:10:29] (03Merged) 10jenkins-bot: Allow to specify additional gateway hosts without overriding the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/773805 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [16:11:09] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [16:11:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P23417 and previous config saved to /var/cache/conftool/dbconfig/20220328-161128-marostegui.json [16:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:52] (03PS4) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [16:13:48] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:08] (03PS1) 10Volans: icinga: fix duplicate description for ChartMuseum [puppet] - 10https://gerrit.wikimedia.org/r/774510 [16:14:11] (03PS5) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [16:14:26] (03CR) 10Klausman: hiera: Add ML staging k8s role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:14:36] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:04] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34594/console" [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:15:43] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:46] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:59] 10SRE, 10SRE-swift-storage: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.9e/9/9e/Christopher_Wilbrand.jpg - https://phabricator.wikimedia.org/T304788 (10herron) p:05Triage→03Medium [16:18:41] 10SRE, 10ChangeProp, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Investigate shorter-lived persistent connections for Envoy - https://phabricator.wikimedia.org/T304799 (10herron) p:05Triage→03Medium [16:19:33] (03CR) 10Elukey: hiera: Add ML staging k8s role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:19:40] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:36] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:28] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10herron) p:05Triage→03Medium [16:22:10] (03CR) 10Volans: [C: 03+2] "self-merging to resolve the duplicate on icinga" [puppet] - 10https://gerrit.wikimedia.org/r/774510 (owner: 10Volans) [16:22:16] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: fix duplicate description for ChartMuseum [puppet] - 10https://gerrit.wikimedia.org/r/774510 (owner: 10Volans) [16:22:27] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:22:27] hah! good timing is good [16:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:36] godog: telepathy! [16:22:39] (03PS6) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [16:22:49] indeed [16:23:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P23418 and previous config saved to /var/cache/conftool/dbconfig/20220328-162350-marostegui.json [16:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:12] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: provision JSON datasource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [16:24:40] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:08] (03PS1) 10Dave Pifke: coal: Use Python 3, add cachelib dependency [puppet] - 10https://gerrit.wikimedia.org/r/774512 (https://phabricator.wikimedia.org/T301638) [16:26:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298556)', diff saved to https://phabricator.wikimedia.org/P23419 and previous config saved to /var/cache/conftool/dbconfig/20220328-162633-marostegui.json [16:26:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:26:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:26:38] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [16:26:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1146.eqiad.wmnet with reason: Maintenance [16:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298556)', diff saved to https://phabricator.wikimedia.org/P23420 and previous config saved to /var/cache/conftool/dbconfig/20220328-162644-marostegui.json [16:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:26] (03PS7) 10Klausman: hiera: Add ML staging k8s role [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) [16:27:30] and icinga finally happy again, no errors, no warnings, no duplicates [16:27:33] 10SRE, 10Thumbor, 10Traffic, 10affects-Kiwix-and-openZIM: MWoffliner scrapes slowed down by Thumbor failure throttling 429s - https://phabricator.wikimedia.org/T304814 (10herron) p:05Triage→03Medium [16:28:32] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:08] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34595/console" [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:29:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS bullseye [16:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye [16:32:09] 10ops-eqiad: thanos-be1003 sdm disk failed - https://phabricator.wikimedia.org/T304868 (10fgiunchedi) [16:32:39] (03CR) 10Klausman: [V: 03+1] hiera: Add ML staging k8s role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774488 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:38:40] PROBLEM - MegaRAID on thanos-be1003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:38:41] ACKNOWLEDGEMENT - MegaRAID on thanos-be1003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T304873 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:38:45] 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10ops-monitoring-bot) [16:38:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300775)', diff saved to https://phabricator.wikimedia.org/P23421 and previous config saved to /var/cache/conftool/dbconfig/20220328-163855-marostegui.json [16:38:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1164.eqiad.wmnet with reason: Maintenance [16:38:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1164.eqiad.wmnet with reason: Maintenance [16:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:03] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [16:39:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T300775)', diff saved to https://phabricator.wikimedia.org/P23422 and previous config saved to /var/cache/conftool/dbconfig/20220328-163903-marostegui.json [16:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:52] 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10fgiunchedi) [16:39:54] 10ops-eqiad: thanos-be1003 sdm disk failed - https://phabricator.wikimedia.org/T304868 (10fgiunchedi) [16:42:31] (03CR) 10Vivian Rook: [C: 03+2] paws: add paws prometheus role/profile [puppet] - 10https://gerrit.wikimedia.org/r/774381 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [16:44:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster [16:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster [16:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298556)', diff saved to https://phabricator.wikimedia.org/P23423 and previous config saved to /var/cache/conftool/dbconfig/20220328-164825-marostegui.json [16:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:31] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [16:50:37] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Email spam from varying tawk.email addresses - https://phabricator.wikimedia.org/T304390 (10Quiddity) 05Open→03Resolved a:03Ladsgroup That worked. Thanks! [16:50:48] (03PS4) 10Reedy: Keystone: Update deprecated action=oathvalidate calls [puppet] - 10https://gerrit.wikimedia.org/r/774401 (https://phabricator.wikimedia.org/T304869) [16:59:27] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-cache1002.eqiad.wmnet with OS bullseye [16:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye executed wit... [17:00:05] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T1700). [17:03:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P23424 and previous config saved to /var/cache/conftool/dbconfig/20220328-170330-marostegui.json [17:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:45] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1142.eqiad.wmnet with OS buster [17:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster exec... [17:06:04] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:55] (03PS1) 10Majavah: P:wmcs::paws::prometheus: fix scrape rules [puppet] - 10https://gerrit.wikimedia.org/r/774516 [17:11:37] (03PS3) 10Sharvaniharan: Config for new android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773896 [17:11:40] (03PS1) 10Ottomata: aptrepo updates - set conda Suite to stable [puppet] - 10https://gerrit.wikimedia.org/r/774517 (https://phabricator.wikimedia.org/T304450) [17:12:15] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) >>! In T300324#7801057, @RLazarus wrote: > Hmm, the 1.21.1 build didn't work out of the box. Running `build-envoy-deb buster future` got me this: > > `... [17:17:23] (03CR) 10Ottomata: [C: 03+2] aptrepo updates - set conda Suite to stable [puppet] - 10https://gerrit.wikimedia.org/r/774517 (https://phabricator.wikimedia.org/T304450) (owner: 10Ottomata) [17:18:15] (03PS2) 10Majavah: P:wmcs::paws::prometheus: fix scrape rules [puppet] - 10https://gerrit.wikimedia.org/r/774516 [17:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P23425 and previous config saved to /var/cache/conftool/dbconfig/20220328-171835-marostegui.json [17:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:36] (03PS1) 10Dave Pifke: deployment-prep: re-point to new bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/774518 (https://phabricator.wikimedia.org/T301638) [17:24:57] (03PS2) 10Dave Pifke: deployment-prep: re-point to new bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/774518 (https://phabricator.wikimedia.org/T301638) [17:26:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) @elukey Since moving the server, I cannot get it to install the OS correctly, can you please take a look. Thanks [17:33:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298556)', diff saved to https://phabricator.wikimedia.org/P23426 and previous config saved to /var/cache/conftool/dbconfig/20220328-173340-marostegui.json [17:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:48] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [17:45:39] (03PS1) 10Kosta Harlan: GrowthExperiments: Add more expanded topics for GLAM campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774519 (https://phabricator.wikimedia.org/T301029) [17:48:06] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:48:36] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:53:56] (03PS1) 10Ottomata: aprepro updates - conda - set Components: main>thirdparty/conda [puppet] - 10https://gerrit.wikimedia.org/r/774522 (https://phabricator.wikimedia.org/T304450) [17:59:37] (03PS1) 10Vivian Rook: Update eqiad1 cloudservices openstack [puppet] - 10https://gerrit.wikimedia.org/r/774523 (https://phabricator.wikimedia.org/T304880) [18:00:28] (03PS2) 10Vivian Rook: Update eqiad1 cloudservices openstack [puppet] - 10https://gerrit.wikimedia.org/r/774523 (https://phabricator.wikimedia.org/T304880) [18:00:49] (03CR) 10Muehlenhoff: coal: Use Python 3, add cachelib dependency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774512 (https://phabricator.wikimedia.org/T301638) (owner: 10Dave Pifke) [18:05:46] (03PS1) 10Hashar: ci: on castor server drop /srv requirement [puppet] - 10https://gerrit.wikimedia.org/r/774525 (https://phabricator.wikimedia.org/T252071) [18:07:27] (03CR) 10Hashar: "Cherry picked on the integration puppet master. Puppet passes on the old instance (integration-castor-03) and pass on the new instance tha" [puppet] - 10https://gerrit.wikimedia.org/r/774525 (https://phabricator.wikimedia.org/T252071) (owner: 10Hashar) [18:08:22] (03CR) 10jerkins-bot: [V: 04-1] ci: on castor server drop /srv requirement [puppet] - 10https://gerrit.wikimedia.org/r/774525 (https://phabricator.wikimedia.org/T252071) (owner: 10Hashar) [18:10:21] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:13:04] (03PS1) 10JMeybohm: Ensure the data in kubernetes secrets is ordered by key [deployment-charts] - 10https://gerrit.wikimedia.org/r/774528 [18:13:56] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T304881 (10RobH) [18:14:17] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10RobH) [18:14:32] 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10RobH) [18:15:32] (03CR) 10jerkins-bot: [V: 04-1] Ensure the data in kubernetes secrets is ordered by key [deployment-charts] - 10https://gerrit.wikimedia.org/r/774528 (owner: 10JMeybohm) [18:20:21] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:24:46] (03PS2) 10JMeybohm: Ensure the data in kubernetes secrets is ordered by key [deployment-charts] - 10https://gerrit.wikimedia.org/r/774528 [18:26:56] 10SRE, 10ops-eqiad: Degraded RAID on thanos-be1003 - https://phabricator.wikimedia.org/T304873 (10herron) p:05Triage→03High [18:27:52] (03PS3) 10JMeybohm: Ensure the data in kubernetes secrets is ordered by key [deployment-charts] - 10https://gerrit.wikimedia.org/r/774528 [18:29:22] (03PS2) 10Hashar: ci: on castor server drop /srv requirement [puppet] - 10https://gerrit.wikimedia.org/r/774525 (https://phabricator.wikimedia.org/T252071) [18:37:34] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4032 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:40:34] (03CR) 10Herron: "+1 on the idea, one comment about the implementation" [puppet] - 10https://gerrit.wikimedia.org/r/774364 (owner: 10Filippo Giunchedi) [18:41:07] (03CR) 10Dzahn: gitlab: add version check to restore script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [18:41:53] (03CR) 10Ottomata: [C: 03+2] aprepro updates - conda - set Components: main>thirdparty/conda [puppet] - 10https://gerrit.wikimedia.org/r/774522 (https://phabricator.wikimedia.org/T304450) (owner: 10Ottomata) [18:42:30] PROBLEM - Host phab2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:42:58] (03CR) 10Dzahn: "Yea, this is it. The "gitlab::restore" class does not exist unless "install_restore_script" (or enable_restore) is set and in the profile " [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [18:43:48] phab2001 going down = unexpectedly died ? [18:43:54] seems like it so far [18:44:04] i left a message in -releng [18:44:08] i think they own it [18:44:59] checking mgmt [18:46:14] doesnt work [18:47:00] PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:47:23] that's one way to speed up that deprecation process :p [18:48:14] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:48:30] ACKNOWLEDGEMENT - SSH on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn just went down https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:48:30] ACKNOWLEDGEMENT - Phabricator SMTP on phab2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn just went down https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [18:48:30] ACKNOWLEDGEMENT - Host phab2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn just went down [18:48:36] (03PS4) 10JMeybohm: Ensure the data in kubernetes secrets is ordered by key [deployment-charts] - 10https://gerrit.wikimedia.org/r/774528 [18:49:41] oh good [18:50:03] yeah, I can't even reach the machine... [18:50:16] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=phab2001.codfw.wmnet [18:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:30] thanks mutante [18:50:43] thcipriani: yw, i'm just glad it's not 1001 :) [18:51:09] that would be a fun way to end monday [18:51:27] so.. that Pybal alert should recover and the "vcs" service [18:51:33] that is what we want to shut down anyways [18:52:03] and phab2002 already exists [18:52:04] RECOVERY - Host phab2001 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [18:52:08] oh, lol [18:52:10] ^ [18:52:11] i gueess it got bored of waiting [18:52:21] and wanted to speed up it going [18:52:56] ok, so .. it's not the host [18:53:01] up 129 days [18:53:06] that was networking then [18:53:07] hrm, "uptime" reports that the box never .... yeah ^ [18:53:14] my guess.. cable or switch port [18:53:16] RECOVERY - PyBal backends health check on lvs2008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:53:17] during other work [18:53:29] maybe ask pa.paul [18:53:35] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=phab2001.codfw.wmnet [18:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:48] RhinosF1: I just did [18:55:12] :) [18:56:24] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:56:38] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:59:14] (03CR) 10Herron: [C: 03+1] sre: add ProbeDown paging alert for enabled services [alerts] - 10https://gerrit.wikimedia.org/r/773747 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [19:03:20] he is working there but in a different rack [19:03:22] hmmm [19:05:48] mutante: i see an amber light flashing on the server will check what going on with the server [19:06:25] papaul: thank you! [19:07:44] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=phab2001.codfw.wmnet [19:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:54] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=phab2001-vcs.codfw.wmnet [19:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:07] ^ wrong name earlier, phab2001-vcs != phab2001 [19:09:32] mutante: Correctable memory error rate exceeded for DIMM_A1. [19:09:43] and the server is very old 6 years [19:09:52] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet [19:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:10] papaul: yea, it's known that it's old and no warranty. ACK [19:10:21] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:11:06] papaul: the "fun" part is that it's running as it didn't happen though.. I guess there is not much for you to do [19:11:22] papaul: we already have replacement hardware .. so ... do nothing I guess [19:11:42] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Ottomata) Alright, I seem to have got reprepro to pull the update: https://apt.wikimedia.org/wikimedia/pool/thirdparty/conda/c/conda/ And, now conda is listed in both buste... [19:16:31] 10SRE, 10Wikimedia-Mailing-lists: lists1001 - Icinga CRIT alerts - https://phabricator.wikimedia.org/T304886 (10Dzahn) [19:17:02] PROBLEM - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [19:17:11] 10SRE, 10Wikimedia-Mailing-lists: lists1001 - Icinga CRIT alerts - https://phabricator.wikimedia.org/T304886 (10Dzahn) [19:17:46] ACKNOWLEDGEMENT - mailman list info on lists1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/postorius/lists/wikimedia-l.lists.wikimedia.org/ - 8571 bytes in 0.261 second response time daniel_zahn https://phabricator.wikimedia.org/T304886 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:17:51] ACKNOWLEDGEMENT - mailman archives on lists1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/hyperkitty/list/wikimedia-l@lists.wikimedia.org/ - 47822 bytes in 0.073 second response time daniel_zahn https://phabricator.wikimedia.org/T304886 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:18:52] 10SRE, 10Wikimedia-Mailing-lists: lists1001 - Icinga CRIT alerts - https://phabricator.wikimedia.org/T304886 (10RhinosF1) Fallout of {T304323}? [19:19:22] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [19:20:09] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet [19:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:22] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10RobH) [19:20:43] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10RobH) [19:22:20] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Majavah) [19:23:14] RECOVERY - PyBal IPVS diff check on lvs2008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:23:56] I repooled it again because otherwise those would alert the entire time ^ [19:24:08] we only have one (1) server for this service [19:25:08] 10SRE, 10Generated Data Platform, 10Service-deployment-requests: New Service Request Image Suggestions Feedback Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) [19:25:34] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:28:38] (03PS1) 10Razzi: Add superset-next domain CNAME [dns] - 10https://gerrit.wikimedia.org/r/774537 (https://phabricator.wikimedia.org/T275575) [19:34:06] (03PS1) 10Razzi: kafka: allow access to jumbo from karapace1001 [puppet] - 10https://gerrit.wikimedia.org/r/774538 (https://phabricator.wikimedia.org/T301562) [19:34:57] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34597/console" [puppet] - 10https://gerrit.wikimedia.org/r/774538 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [19:39:20] 10SRE, 10Wikimedia-Mailing-lists: lists1001 - Icinga CRIT alerts - https://phabricator.wikimedia.org/T304886 (10Dzahn) @RhinosF1 Yea, that sounds likely since the check commands are: ` check_command => "check_https_url_for_string!${lists_servername}!/hyperkitty/list/wikimedia-l@lists.wikimedia.org/!\... [19:40:40] (03CR) 10Herron: "Hi, it looks like ms and thanos hosts are having some puppet errors, for instance:" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [19:42:12] 10SRE, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3), and 2 others: Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10Dzahn) looks like icinga mailman checks are affected by this T304886 [19:45:08] (03PS1) 10Dzahn: icinga/lists: fix double quoted mailman monitoring check commands [puppet] - 10https://gerrit.wikimedia.org/r/774540 (https://phabricator.wikimedia.org/T304323) [19:46:06] (03CR) 10jerkins-bot: [V: 04-1] icinga/lists: fix double quoted mailman monitoring check commands [puppet] - 10https://gerrit.wikimedia.org/r/774540 (https://phabricator.wikimedia.org/T304323) (owner: 10Dzahn) [19:47:19] (03CR) 10Volans: "There is already Iaa646d572eca7f59f87dc34813e90deaa9ec0ee6 with the same fix." [puppet] - 10https://gerrit.wikimedia.org/r/774540 (https://phabricator.wikimedia.org/T304323) (owner: 10Dzahn) [19:47:21] (03PS2) 10Dzahn: icinga/lists: fix double quoted mailman monitoring check commands [puppet] - 10https://gerrit.wikimedia.org/r/774540 (https://phabricator.wikimedia.org/T304323) [19:47:39] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/774408 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [19:48:05] (03Abandoned) 10Dzahn: icinga/lists: fix double quoted mailman monitoring check commands [puppet] - 10https://gerrit.wikimedia.org/r/774540 (https://phabricator.wikimedia.org/T304323) (owner: 10Dzahn) [19:48:20] (03PS2) 10Dzahn: lists: remove double quoting for http check [puppet] - 10https://gerrit.wikimedia.org/r/774408 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [19:49:42] (03CR) 10Dzahn: [C: 03+1] "but dependent on relation chain" [puppet] - 10https://gerrit.wikimedia.org/r/774407 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [19:50:39] (03CR) 10Cwhite: [C: 04-1] logging: bump alerts logs retention (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/774364 (owner: 10Filippo Giunchedi) [19:51:17] 10SRE-Access-Requests: Requesting access to jclark-ctr - https://phabricator.wikimedia.org/T304896 (10Jclark-ctr) [19:51:36] 10SRE, 10Infrastructure-Foundations: Many Ganeti hosts have disk space warnings on /boot - https://phabricator.wikimedia.org/T304897 (10herron) [19:52:52] (03PS1) 10RobH: update john's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/774542 (https://phabricator.wikimedia.org/T3048961) [19:53:20] 10SRE, 10Generated Data Platform, 10Service-deployment-requests: New Service Request Generated Data Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) [19:54:18] (03PS2) 10RobH: update john's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/774542 (https://phabricator.wikimedia.org/T304896) [19:54:27] 10SRE, 10Generated Data Platform, 10Service-deployment-requests: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10WDoranWMF) [19:54:51] (03CR) 10RobH: [C: 03+2] update john's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/774542 (https://phabricator.wikimedia.org/T304896) (owner: 10RobH) [19:55:30] (03PS1) 10Dzahn: gitlab: remove conditional, always use gitlab::restore class [puppet] - 10https://gerrit.wikimedia.org/r/774543 (https://phabricator.wikimedia.org/T274463) [19:56:01] (03CR) 10Dzahn: "@Jelto https://gerrit.wikimedia.org/r/c/operations/puppet/+/774543/" [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [19:57:33] 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to jclark-ctr - https://phabricator.wikimedia.org/T304896 (10RobH) 05Open→03Resolved a:03RobH I confirmed John's identity via google hangout, after chatting with him about this on irc and his filing of this request. change is merged live. as... [20:00:04] RoanKattouw and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T2000). [20:00:05] MatmaRex, zabe, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] hi [20:00:18] hi [20:00:23] hey [20:00:31] (03CR) 10Dzahn: [C: 03+1] gitlab: add version check to restore script [puppet] - 10https://gerrit.wikimedia.org/r/773783 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [20:01:01] Hello. I can deploy in ~5 minutes if no one else's around. [20:02:34] (03PS1) 10Andrew Bogott: profile::wmcs::nfs::standalone: stop nfs server if cinder volume is detached [puppet] - 10https://gerrit.wikimedia.org/r/774544 (https://phabricator.wikimedia.org/T304706) [20:06:01] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfs::standalone: stop nfs server if cinder volume is detached [puppet] - 10https://gerrit.wikimedia.org/r/774544 (https://phabricator.wikimedia.org/T304706) (owner: 10Andrew Bogott) [20:06:57] (03CR) 10Dzahn: gitlab: move systemd interval for backup and restore to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [20:07:32] (03CR) 10Dzahn: "looks good to me, you could use "Systemd::Timer::Schedule" instead of String though as data type for these" [puppet] - 10https://gerrit.wikimedia.org/r/774416 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [20:07:34] I'm here now [20:08:14] hi MatmaRex, i see that the backport is abandoned in master. Is that intentional? [20:08:46] yes [20:09:01] we fixed it differently in master, but this is easier to backport [20:09:01] zabe: your change has a merge conflict. Can you rebase manually please? [20:10:29] MatmaRex: okay, thanks for the clarification. It looks safe enough, so I'm going to merge it [20:10:33] (03CR) 10Urbanecm: [C: 03+2] Disable backtick sequence in ve-mw while conflict with Catalan is investigated [extensions/VisualEditor] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774400 (https://phabricator.wikimedia.org/T304804) (owner: 10Bartosz Dziewoński) [20:11:08] 10SRE, 10Infrastructure-Foundations: puppetmaster1001 disk warning on / - https://phabricator.wikimedia.org/T304898 (10herron) p:05Triage→03High [20:11:11] kostajh: just a quick clarification, does your config change depend on the NS_MEDIAWIKI you asked me to introduce via Slack? [20:11:39] or can i go ahead with it now, and do the msg once we know what the exact text needs to be? [20:12:37] (03PS3) 10Zabe: Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956) [20:13:11] urbanecm: you can go ahead with it now [20:13:13] urbanecm, done [20:13:23] kostajh: okay, will do. [20:13:28] (03CR) 10Cwhite: "Cannot add datasource "JSON API: Cannot connect to API"" [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [20:13:30] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Add more expanded topics for GLAM campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774519 (https://phabricator.wikimedia.org/T301029) (owner: 10Kosta Harlan) [20:14:50] !log pruned /var/log/apache2/puppetmaster.puppet.log.[123]* on puppetmaster1001 T304898 [20:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:56] T304898: puppetmaster1001 disk warning on / - https://phabricator.wikimedia.org/T304898 [20:15:21] (03Merged) 10jenkins-bot: GrowthExperiments: Add more expanded topics for GLAM campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774519 (https://phabricator.wikimedia.org/T301029) (owner: 10Kosta Harlan) [20:16:12] kostajh: pulled to mwdebug1001, can you have a look please? [20:16:21] urbanecm: looking [20:16:51] (03CR) 10Dzahn: [C: 03+1] "yep, valid in "systemd-analyze calendar "" [puppet] - 10https://gerrit.wikimedia.org/r/774418 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [20:17:04] urbanecm: looks good to me [20:17:13] kostajh: thx, syncing [20:18:00] (03PS4) 10Urbanecm: Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:18:04] (03CR) 10Urbanecm: [C: 03+2] Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:18:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e8a5b3b662db6780c0ed9a33e07e54e84295d1dd: GrowthExperiments: Add more expanded topics for GLAM campaign (T301029) (duration: 00m 50s) [20:18:32] kostajh: your change should be live now :) [20:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:35] T301029: Account creation: GLAM event topic availability - https://phabricator.wikimedia.org/T301029 [20:18:54] thanks! [20:19:08] (03Merged) 10jenkins-bot: Stop writing to $wmfAllServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773609 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:19:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:19:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:15] zabe: your patch is at mwdebug1001. Can you check? [20:21:24] urbanecm, lgtm [20:21:27] syncing [20:22:42] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: dfa963895f39760b647be5507c7f74ec3489cd22: Stop writing to $wmfAllServices (T45956) (duration: 00m 55s) [20:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:48] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [20:22:50] zabe: should be live. [20:22:52] anything else? [20:23:14] no, thanks [20:23:28] no problem [20:23:44] (03Abandoned) 10Dzahn: puppetmaster::geoip: stop using class for legacy maxmind downloads in prod [puppet] - 10https://gerrit.wikimedia.org/r/773649 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [20:25:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:21] 10SRE, 10MediaWiki-Stakeholders-Group, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Performance-Team (Radar): RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588 (10cscott) A lot more work was done under {T114542} and eventually several projects were started up. #marvin was the off... [20:26:24] * urbanecm is waiting for CI on MatmaRex's patch [20:26:49] (03CR) 10BryanDavis: [C: 03+1] "Untested, but the changes look correct visually." [puppet] - 10https://gerrit.wikimedia.org/r/774401 (https://phabricator.wikimedia.org/T304869) (owner: 10Reedy) [20:28:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:31] (03PS2) 10Dzahn: puppetmaster:geoip: stop trying to download GeoIP1 legacy databases [puppet] - 10https://gerrit.wikimedia.org/r/773843 (https://phabricator.wikimedia.org/T303464) [20:31:00] (03Merged) 10jenkins-bot: Disable backtick sequence in ve-mw while conflict with Catalan is investigated [extensions/VisualEditor] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774400 (https://phabricator.wikimedia.org/T304804) (owner: 10Bartosz Dziewoński) [20:31:43] (03PS1) 10Kosta Harlan: GLAM events: add topic match mode widget selector [extensions/GrowthExperiments] (wmf/1.39.0-wmf.4) - 10https://gerrit.wikimedia.org/r/774405 (https://phabricator.wikimedia.org/T301825) [20:32:27] MatmaRex: pulled to mwdebug1001, can you check please? [20:33:01] urbanecm: yeah. looks good [20:33:09] syncing [20:34:40] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.4/extensions/VisualEditor/modules/ve-mw/ui/ve.ui.MWSequenceRegistry.js: f32ae21f2456b69d615c0d63fc12cff097ba3e31: Disable backtick sequence in ve-mw while conflict with Catalan is investigated (T304804) (duration: 00m 57s) [20:34:44] MatmaRex: and, live [20:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:46] anything else? [20:34:47] T304804: Unable to type grave accents (backtick) in visual editor - https://phabricator.wikimedia.org/T304804 [20:36:18] thanks [20:36:20] np [20:38:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:39:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Jclark-ctr) [20:42:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Jclark-ctr) Racked and cabled updated netbox with connections [20:42:38] (03CR) 10Dzahn: [C: 04-1] "I made a a test by removing one of the databases and running the update and confirmed as of today we are still succesfully downloading new" [puppet] - 10https://gerrit.wikimedia.org/r/773843 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [20:51:17] (03PS2) 10SBassett: Set StopForumSpam to enforce on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773340 (https://phabricator.wikimedia.org/T304111) [20:51:25] (03PS2) 10Dzahn: geoip::maxmind: remove code for absenting old resources [puppet] - 10https://gerrit.wikimedia.org/r/773844 (https://phabricator.wikimedia.org/T303464) [20:51:57] (03CR) 10Dzahn: [C: 03+2] "confirmed none of these files exist on puppetmaster1001, just code cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/773844 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [20:55:31] jouncebot: nowandnext [20:55:31] For the next 0 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T2000) [20:55:31] In 0 hour(s) and 4 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T2100) [20:57:49] (03CR) 10SBassett: [C: 03+2] Set StopForumSpam to enforce on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773340 (https://phabricator.wikimedia.org/T304111) (owner: 10SBassett) [20:58:54] (03Merged) 10jenkins-bot: Set StopForumSpam to enforce on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773340 (https://phabricator.wikimedia.org/T304111) (owner: 10SBassett) [21:00:04] 10SRE, 10Thumbor, 10Traffic, 10affects-Kiwix-and-openZIM: MWoffliner scrapes slowed down by Thumbor failure throttling 429s - https://phabricator.wikimedia.org/T304814 (10AntiCompositeNumber) > But, even if we agree with that, what is sure is that it can not be that a random final user, after one request,... [21:00:05] Reedy and sbassett: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220328T2100). Please do the needful. [21:00:25] (03PS2) 10Dzahn: geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) [21:00:36] (03PS3) 10Dzahn: geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) [21:01:40] (03CR) 10jerkins-bot: [V: 04-1] geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:03:46] !log sbassett@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Deploy CS-labs.php config to set StopForumSpam to enforce on beta (duration: 01m 03s) [21:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:17] (03PS4) 10Dzahn: geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) [21:05:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:54] sbassett: on the offchance you won't need the whole window, let me know when you're finished? no rush :) [21:06:14] (03CR) 10jerkins-bot: [V: 04-1] geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:06:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:06:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:05] (03PS5) 10Dzahn: geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) [21:09:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:52] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34600/" [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:09:59] (03CR) 10jerkins-bot: [V: 04-1] geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:13:07] (03PS6) 10Dzahn: geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) [21:13:57] (03CR) 10jerkins-bot: [V: 04-1] geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:15:35] (03PS7) 10Dzahn: geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) [21:19:07] 10SRE, 10MediaWiki-Stakeholders-Group, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Performance-Team (Radar): RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588 (10Krinkle) A more narrow proposal, driven by specific performance and user experience outcomes, exists at {T140664} as... [21:19:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:20:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:20] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/34602/" [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:27:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:27:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:03] !log Undeployed sec patch for T285159, which caused a high volume of errors on the canaries [21:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:34:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:56] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:55:30] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [22:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:14:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300775)', diff saved to https://phabricator.wikimedia.org/P23429 and previous config saved to /var/cache/conftool/dbconfig/20220328-221448-marostegui.json [22:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:55] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [22:20:45] (JobUnavailable) firing: Reduced availability for job trafficserver in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:28:49] (03PS8) 10RLazarus: mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [22:29:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P23430 and previous config saved to /var/cache/conftool/dbconfig/20220328-222953-marostegui.json [22:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:27] !log rzl@cumin2002:~$ sudo cumin A:mw 'disable-puppet T205361' [22:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:32] T205361: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 [22:31:39] (03CR) 10RLazarus: [C: 03+2] mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [22:39:14] !log rzl@cumin2002:~$ sudo cumin A:mw 'enable-puppet T205361' [22:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:20] T205361: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 [22:40:10] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:40:38] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:44:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P23431 and previous config saved to /var/cache/conftool/dbconfig/20220328-224459-marostegui.json [22:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:39] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: Update deprecated action=oathvalidate calls [puppet] - 10https://gerrit.wikimedia.org/r/774401 (https://phabricator.wikimedia.org/T304869) (owner: 10Reedy) [23:00:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300775)', diff saved to https://phabricator.wikimedia.org/P23433 and previous config saved to /var/cache/conftool/dbconfig/20220328-230004-marostegui.json [23:00:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [23:00:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [23:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:09] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [23:00:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T300775)', diff saved to https://phabricator.wikimedia.org/P23434 and previous config saved to /var/cache/conftool/dbconfig/20220328-230012-marostegui.json [23:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:04] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:10:45] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:22:38] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:26:46] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:32:33] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) a:05RobH→03Papaul Ok, I dug in some more and I've gotten some success, but not enough. I'm wondering if @papaul may have some time to review this as well. I've attached the copy of the Dell g... [23:36:56] PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [23:37:11] mrm [23:38:14] giving that a moment to see if it self-resolves like it did yesterday, else I'll see if a rolling restart is needed [23:38:54] RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.014 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems