[00:04:44] ACKNOWLEDGEMENT - Ensure legal html en.wb on en.wikibooks.org is CRITICAL: Text\sis\savailable\sunder\sthe a\shref=\/\/creativecommons\.org\/licenses\/by-sa\/3\.0\/Creative\sCommons\sAttribution-ShareAlike\sLicense./a: additional\sterms\smay\sapply\. html not found daniel_zahn https://phabricator.wikimedia.org/T317169#8216329 https://phabricator.wikimedia.org/project/members/28/ [00:05:26] heh [00:32:36] (03PS3) 10Eevans: cassandra: Create new role for testing AQS bulk-loader changes [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) [00:32:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:33:22] (03CR) 10CI reject: [V: 04-1] cassandra: Create new role for testing AQS bulk-loader changes [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) (owner: 10Eevans) [00:36:01] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:39:49] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:47:01] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:59:05] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:03:55] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:08:39] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:19:43] (03PS1) 10Jdlrobson: Wikidata has a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830312 (https://phabricator.wikimedia.org/T315572) [01:20:41] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:20:56] (03CR) 10CI reject: [V: 04-1] Wikidata has a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830312 (https://phabricator.wikimedia.org/T315572) (owner: 10Jdlrobson) [01:24:59] (03PS2) 10Jdlrobson: Wikidata has a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830312 (https://phabricator.wikimedia.org/T315572) [01:25:01] (03PS1) 10Jdlrobson: Enable Extension:Nearby on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830313 (https://phabricator.wikimedia.org/T246493) [01:30:17] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:31:51] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:07] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:49:33] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:51:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T314041)', diff saved to https://phabricator.wikimedia.org/P33982 and previous config saved to /var/cache/conftool/dbconfig/20220907-015116-ladsgroup.json [01:51:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [01:51:20] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:51:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [01:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2140 (T314041)', diff saved to https://phabricator.wikimedia.org/P33983 and previous config saved to /var/cache/conftool/dbconfig/20220907-015138-ladsgroup.json [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:35] RECOVERY - Check systemd state on ms-be1071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [02:27:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [02:33:09] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:42:31] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:49:47] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:18:25] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:26:57] (03CR) 10Ori: Sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [03:30:03] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:17] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: search-drop-query-clicks.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:31] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.292 second response time https://wikitech.wikimedia.org/wiki/Swift [03:51:49] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [03:52:13] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:20:59] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:25:49] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:37:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:43:19] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.138 second response time https://wikitech.wikimedia.org/wiki/Swift [04:45:35] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [05:21:05] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:25:57] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:27:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:26] (03PS1) 10Marostegui: Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830289 [05:30:19] (03CR) 10Marostegui: [C: 03+2] Revert "db1111: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830289 (owner: 10Marostegui) [05:30:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33985 and previous config saved to /var/cache/conftool/dbconfig/20220907-053053-root.json [05:31:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1174 T316342', diff saved to https://phabricator.wikimedia.org/P33986 and previous config saved to /var/cache/conftool/dbconfig/20220907-053154-root.json [05:31:57] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [05:33:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1172 T316342', diff saved to https://phabricator.wikimedia.org/P33988 and previous config saved to /var/cache/conftool/dbconfig/20220907-053350-root.json [05:36:24] (03PS1) 10Giuseppe Lavagetto: docker_registry_ha: clean up after nginx's socket [puppet] - 10https://gerrit.wikimedia.org/r/830483 [05:37:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker_registry_ha: clean up after nginx's socket [puppet] - 10https://gerrit.wikimedia.org/r/830483 (owner: 10Giuseppe Lavagetto) [05:38:32] (03PS1) 10Marostegui: db1196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830484 (https://phabricator.wikimedia.org/T316342) [05:39:11] (03CR) 10Marostegui: [C: 03+2] db1196: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830484 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [05:43:13] (03PS1) 10Marostegui: instances.yaml: Add db1196 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830485 (https://phabricator.wikimedia.org/T316342) [05:45:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33990 and previous config saved to /var/cache/conftool/dbconfig/20220907-054557-root.json [05:47:18] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1196 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830485 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [05:48:23] (03PS1) 10Giuseppe Lavagetto: docker_registry_ha: more fixes to the nginx configuration [puppet] - 10https://gerrit.wikimedia.org/r/830486 [05:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1196 to s1, depooled, T316342', diff saved to https://phabricator.wikimedia.org/P33991 and previous config saved to /var/cache/conftool/dbconfig/20220907-054910-marostegui.json [05:49:13] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [05:51:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker_registry_ha: more fixes to the nginx configuration [puppet] - 10https://gerrit.wikimedia.org/r/830486 (owner: 10Giuseppe Lavagetto) [05:52:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pooling db1196 for the first time in s1 T316342', diff saved to https://phabricator.wikimedia.org/P33992 and previous config saved to /var/cache/conftool/dbconfig/20220907-055201-marostegui.json [05:54:11] (03PS1) 10Marostegui: db1197: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830487 (https://phabricator.wikimedia.org/T316342) [05:55:57] (03CR) 10Marostegui: [C: 03+2] db1197: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830487 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [05:56:12] _joe_: ok to merge your changes? [05:56:25] <_joe_> yeah sorry [05:56:31] no problem! [05:56:33] merging [05:56:34] <_joe_> I thought I did not submit it [06:00:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33993 and previous config saved to /var/cache/conftool/dbconfig/20220907-060102-root.json [06:04:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 2%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33994 and previous config saved to /var/cache/conftool/dbconfig/20220907-060401-root.json [06:05:51] (03PS1) 10Marostegui: instances.yaml: Add db1197 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830489 (https://phabricator.wikimedia.org/T316342) [06:06:34] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1197 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830489 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [06:08:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1197 to s2, depooled, T316342', diff saved to https://phabricator.wikimedia.org/P33995 and previous config saved to /var/cache/conftool/dbconfig/20220907-060828-marostegui.json [06:08:32] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [06:11:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pooling db1197 for the first time in s2 T316342', diff saved to https://phabricator.wikimedia.org/P33996 and previous config saved to /var/cache/conftool/dbconfig/20220907-061147-marostegui.json [06:16:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33997 and previous config saved to /var/cache/conftool/dbconfig/20220907-061607-root.json [06:17:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1197 (re)pooling @ 2%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33998 and previous config saved to /var/cache/conftool/dbconfig/20220907-061747-root.json [06:19:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 3%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33999 and previous config saved to /var/cache/conftool/dbconfig/20220907-061906-root.json [06:19:07] (03PS1) 10Marostegui: db1198: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830490 [06:19:53] (03CR) 10Marostegui: [C: 03+2] db1198: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830490 (owner: 10Marostegui) [06:31:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34000 and previous config saved to /var/cache/conftool/dbconfig/20220907-063112-root.json [06:32:25] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:32:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1197 (re)pooling @ 3%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34001 and previous config saved to /var/cache/conftool/dbconfig/20220907-063252-root.json [06:33:29] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37142/" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [06:34:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 4%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34002 and previous config saved to /var/cache/conftool/dbconfig/20220907-063410-root.json [06:36:51] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) @colewhite do you know if there are remaining clients still not using the new TLS bundle? IIRC there were a couple... [06:39:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 234, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:39:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:42:01] (03PS2) 10Muehlenhoff: swift: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811233 (https://phabricator.wikimedia.org/T308013) [06:42:07] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:42:26] (03PS1) 10Marostegui: instances.yaml: Add db1198 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830491 (https://phabricator.wikimedia.org/T316342) [06:43:11] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1198 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830491 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [06:49:58] (03PS1) 10Marostegui: mariadb: Productionize db1202 [puppet] - 10https://gerrit.wikimedia.org/r/830492 (https://phabricator.wikimedia.org/T316342) [06:50:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1202 [puppet] - 10https://gerrit.wikimedia.org/r/830492 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [06:52:05] (03PS1) 10Marostegui: install_server: Do not reimage db1197 [puppet] - 10https://gerrit.wikimedia.org/r/830493 [06:52:52] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1197 [puppet] - 10https://gerrit.wikimedia.org/r/830493 (owner: 10Marostegui) [06:56:49] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:04] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220907T0700). [07:00:05] _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:34] o/ [07:01:21] * urbanecm assumes _joe_ will self-serve [07:01:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34008 and previous config saved to /var/cache/conftool/dbconfig/20220907-070122-root.json [07:03:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1197 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34009 and previous config saved to /var/cache/conftool/dbconfig/20220907-070301-root.json [07:04:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34010 and previous config saved to /var/cache/conftool/dbconfig/20220907-070420-root.json [07:04:36] (03PS1) 10Marostegui: mariadb: Productionize db1203 [puppet] - 10https://gerrit.wikimedia.org/r/830494 (https://phabricator.wikimedia.org/T316342) [07:07:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 2%: Pooling for the first time in s3', diff saved to https://phabricator.wikimedia.org/P34011 and previous config saved to /var/cache/conftool/dbconfig/20220907-070750-root.json [07:08:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1203 [puppet] - 10https://gerrit.wikimedia.org/r/830494 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [07:09:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34012 and previous config saved to /var/cache/conftool/dbconfig/20220907-071627-root.json [07:17:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2146 and db2122', diff saved to https://phabricator.wikimedia.org/P34013 and previous config saved to /var/cache/conftool/dbconfig/20220907-071744-root.json [07:18:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1197 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34014 and previous config saved to /var/cache/conftool/dbconfig/20220907-071806-root.json [07:19:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34015 and previous config saved to /var/cache/conftool/dbconfig/20220907-071925-root.json [07:21:01] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) As we are doing multi-dc now, I have installed the new package with the fix also on db2122 (s7) and db2146 (s1) which a... [07:21:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 1%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34016 and previous config saved to /var/cache/conftool/dbconfig/20220907-072151-root.json [07:22:13] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10pfischer) a:05pfischer→03bking @bking, thanks! I works, at least I'm able to log into `thanos` and `grafana-rw`, now.... [07:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34017 and previous config saved to /var/cache/conftool/dbconfig/20220907-072214-root.json [07:22:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 3%: Pooling for the first time in s3', diff saved to https://phabricator.wikimedia.org/P34018 and previous config saved to /var/cache/conftool/dbconfig/20220907-072255-root.json [07:26:27] (03PS1) 10Marostegui: db1199: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830497 (https://phabricator.wikimedia.org/T316342) [07:27:07] (03CR) 10Marostegui: [C: 03+2] db1199: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830497 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [07:28:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [07:28:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [07:31:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34019 and previous config saved to /var/cache/conftool/dbconfig/20220907-073131-root.json [07:32:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [07:32:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [07:33:09] (03CR) 10Ori: [C: 03+1] Sort query parameters in URLs [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [07:33:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1197 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34020 and previous config saved to /var/cache/conftool/dbconfig/20220907-073311-root.json [07:33:25] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [07:34:23] (03PS1) 10Marostegui: instances.yaml: Add db1199 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830498 (https://phabricator.wikimedia.org/T316342) [07:34:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34021 and previous config saved to /var/cache/conftool/dbconfig/20220907-073430-root.json [07:35:08] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1199 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830498 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [07:35:31] (03PS1) 10Cathal Mooney: Depool eqsin for core router upgrades. [dns] - 10https://gerrit.wikimedia.org/r/830499 (https://phabricator.wikimedia.org/T295690) [07:36:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 2%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34022 and previous config saved to /var/cache/conftool/dbconfig/20220907-073655-root.json [07:37:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1199 to s4, depooled, T316342', diff saved to https://phabricator.wikimedia.org/P34023 and previous config saved to /var/cache/conftool/dbconfig/20220907-073727-marostegui.json [07:37:31] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [07:37:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34024 and previous config saved to /var/cache/conftool/dbconfig/20220907-073732-root.json [07:37:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pooling db1199 for the first time in s4 T316342', diff saved to https://phabricator.wikimedia.org/P34025 and previous config saved to /var/cache/conftool/dbconfig/20220907-073745-marostegui.json [07:38:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 4%: Pooling for the first time in s3', diff saved to https://phabricator.wikimedia.org/P34026 and previous config saved to /var/cache/conftool/dbconfig/20220907-073800-root.json [07:38:31] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10pfischer) @Gehel, looks good to me, I'm at least able to SSH into bastion 3005 (esams). [07:40:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34027 and previous config saved to /var/cache/conftool/dbconfig/20220907-074039-root.json [07:41:28] (03CR) 10Ayounsi: [C: 03+1] Depool eqsin for core router upgrades. [dns] - 10https://gerrit.wikimedia.org/r/830499 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [07:42:24] (03PS1) 10Marostegui: install_server: Do not reimage db1198 [puppet] - 10https://gerrit.wikimedia.org/r/830500 [07:42:37] (03CR) 10Cathal Mooney: [C: 03+2] Depool eqsin for core router upgrades. [dns] - 10https://gerrit.wikimedia.org/r/830499 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [07:43:14] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1198 [puppet] - 10https://gerrit.wikimedia.org/r/830500 (owner: 10Marostegui) [07:44:32] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet [07:46:16] !log Depool eqsin from user traffic in advance of core router upgrades - T295690 [07:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:19] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [07:46:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34028 and previous config saved to /var/cache/conftool/dbconfig/20220907-074636-root.json [07:47:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1199 (re)pooling @ 2%: Pooling for the first time in s4', diff saved to https://phabricator.wikimedia.org/P34029 and previous config saved to /var/cache/conftool/dbconfig/20220907-074746-root.json [07:48:15] (03CR) 10Muehlenhoff: [C: 03+2] swift: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811233 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:48:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1197 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34030 and previous config saved to /var/cache/conftool/dbconfig/20220907-074816-root.json [07:49:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34031 and previous config saved to /var/cache/conftool/dbconfig/20220907-074935-root.json [07:49:58] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi) [07:50:04] 10SRE, 10netops: Junos changes for management-instance support on QFX - https://phabricator.wikimedia.org/T269340 (10ayounsi) [07:51:08] (03PS1) 10Marostegui: db2122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830502 [07:51:51] (03CR) 10Marostegui: [C: 03+2] db2122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830502 (owner: 10Marostegui) [07:52:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 3%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34032 and previous config saved to /var/cache/conftool/dbconfig/20220907-075200-root.json [07:52:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34033 and previous config saved to /var/cache/conftool/dbconfig/20220907-075237-root.json [07:53:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 5%: Pooling for the first time in s3', diff saved to https://phabricator.wikimedia.org/P34034 and previous config saved to /var/cache/conftool/dbconfig/20220907-075305-root.json [07:55:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34035 and previous config saved to /var/cache/conftool/dbconfig/20220907-075544-root.json [07:56:43] (03PS1) 10Marostegui: intances.yaml: Add db1200 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830555 (https://phabricator.wikimedia.org/T316342) [07:57:12] <_joe_> urbanecm: yeah sorry, I was sure I scheduled the deploy for tomorrow [07:57:22] (03PS2) 10Muehlenhoff: network: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811230 (https://phabricator.wikimedia.org/T308013) [07:57:25] no worries :) [07:57:27] <_joe_> I had an appointment at 9 am this morning so no way I could respect the timing [07:57:48] (03CR) 10Marostegui: [C: 03+2] intances.yaml: Add db1200 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830555 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [07:58:15] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1200 to s5, depooled, T316342', diff saved to https://phabricator.wikimedia.org/P34036 and previous config saved to /var/cache/conftool/dbconfig/20220907-075919-marostegui.json [07:59:23] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [08:01:34] (03CR) 10Vgutierrez: [C: 03+1] Remove 185.15.56.0/24 from network::external (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [08:02:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1199 (re)pooling @ 3%: Pooling for the first time in s4', diff saved to https://phabricator.wikimedia.org/P34037 and previous config saved to /var/cache/conftool/dbconfig/20220907-080251-root.json [08:03:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1197 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34038 and previous config saved to /var/cache/conftool/dbconfig/20220907-080321-root.json [08:03:39] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [08:04:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [08:04:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34039 and previous config saved to /var/cache/conftool/dbconfig/20220907-080439-root.json [08:04:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T312863)', diff saved to https://phabricator.wikimedia.org/P34040 and previous config saved to /var/cache/conftool/dbconfig/20220907-080449-ladsgroup.json [08:04:53] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [08:06:38] (03PS1) 10Marostegui: db1200: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830560 [08:07:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 4%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34041 and previous config saved to /var/cache/conftool/dbconfig/20220907-080705-root.json [08:07:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34042 and previous config saved to /var/cache/conftool/dbconfig/20220907-080742-root.json [08:07:54] (03CR) 10Marostegui: [C: 03+2] db1200: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830560 (owner: 10Marostegui) [08:08:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 10%: Pooling for the first time in s3', diff saved to https://phabricator.wikimedia.org/P34043 and previous config saved to /var/cache/conftool/dbconfig/20220907-080810-root.json [08:08:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pooling db1200 for the first time in s5 T316342', diff saved to https://phabricator.wikimedia.org/P34044 and previous config saved to /var/cache/conftool/dbconfig/20220907-080825-marostegui.json [08:08:30] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [08:10:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34045 and previous config saved to /var/cache/conftool/dbconfig/20220907-081049-root.json [08:13:25] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:13:28] (03CR) 10Klausman: [C: 03+1] api-gateway: don't conditionally rewrite if asked, always do it [deployment-charts] - 10https://gerrit.wikimedia.org/r/830196 (owner: 10Hnowlan) [08:16:09] PROBLEM - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] does not match [elastic2054.codfw.wmnet:9700, elastic2076.codfw.wmnet:9700, elastic2080.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [08:16:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 2%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34046 and previous config saved to /var/cache/conftool/dbconfig/20220907-081655-root.json [08:17:19] PROBLEM - ElasticSearch setting check - 9200 on elastic2042 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] does not match [elastic2054.codfw.wmnet:9700, elastic2076.codfw.wmnet:9700, elastic2080.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [08:17:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1199 (re)pooling @ 4%: Pooling for the first time in s4', diff saved to https://phabricator.wikimedia.org/P34047 and previous config saved to /var/cache/conftool/dbconfig/20220907-081756-root.json [08:18:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1197 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34048 and previous config saved to /var/cache/conftool/dbconfig/20220907-081826-root.json [08:22:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34049 and previous config saved to /var/cache/conftool/dbconfig/20220907-082210-root.json [08:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34050 and previous config saved to /var/cache/conftool/dbconfig/20220907-082247-root.json [08:23:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: Pooling for the first time in s3', diff saved to https://phabricator.wikimedia.org/P34051 and previous config saved to /var/cache/conftool/dbconfig/20220907-082315-root.json [08:24:22] (03PS1) 10Cathal Mooney: Disbale VRRP auth in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/830563 (https://phabricator.wikimedia.org/T295690) [08:25:35] (03PS1) 10Marostegui: db1201: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830564 [08:25:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34052 and previous config saved to /var/cache/conftool/dbconfig/20220907-082554-root.json [08:26:22] (03CR) 10Marostegui: [C: 03+2] db1201: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830564 (owner: 10Marostegui) [08:26:33] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.292 second response time https://wikitech.wikimedia.org/wiki/Swift [08:26:54] (03CR) 10Ayounsi: [C: 03+1] Disbale VRRP auth in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/830563 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [08:27:29] (03CR) 10Cathal Mooney: [C: 03+2] Disbale VRRP auth in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/830563 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [08:27:35] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:30] (03Merged) 10jenkins-bot: Disbale VRRP auth in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/830563 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [08:28:41] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Swift [08:31:23] (03CR) 10Vgutierrez: [C: 04-1] Sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [08:31:53] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:32:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 3%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34053 and previous config saved to /var/cache/conftool/dbconfig/20220907-083200-root.json [08:33:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1199 (re)pooling @ 5%: Pooling for the first time in s4', diff saved to https://phabricator.wikimedia.org/P34054 and previous config saved to /var/cache/conftool/dbconfig/20220907-083300-root.json [08:33:46] (03PS5) 10Giuseppe Lavagetto: Move 50% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823680 (https://phabricator.wikimedia.org/T271736) [08:35:01] (03PS1) 10Marostegui: instances.yaml: Add db1201 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830565 (https://phabricator.wikimedia.org/T316342) [08:35:25] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-eqsin.wikimedia.org with reason: router upgrade [08:35:26] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr2-eqsin.wikimedia.org with reason: router upgrade [08:35:41] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-eqsin with reason: router upgrade [08:35:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-eqsin with reason: router upgrade [08:36:00] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7af287ca-21ab-4f9d-adb3-478641fdd465) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and their services with reas... [08:36:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move 50% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823680 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [08:37:10] !log cmooney@cumin1001 START - Cookbook sre.network.cf [08:37:10] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:37:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34055 and previous config saved to /var/cache/conftool/dbconfig/20220907-083715-root.json [08:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:37:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34056 and previous config saved to /var/cache/conftool/dbconfig/20220907-083752-root.json [08:37:59] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:38:15] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1201 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830565 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [08:38:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: Pooling for the first time in s3', diff saved to https://phabricator.wikimedia.org/P34057 and previous config saved to /var/cache/conftool/dbconfig/20220907-083820-root.json [08:39:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 (s6 master) from API', diff saved to https://phabricator.wikimedia.org/P34058 and previous config saved to /var/cache/conftool/dbconfig/20220907-083958-root.json [08:40:47] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:40:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1201 to s6, depooled, T316342', diff saved to https://phabricator.wikimedia.org/P34059 and previous config saved to /var/cache/conftool/dbconfig/20220907-084057-marostegui.json [08:41:00] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [08:41:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34060 and previous config saved to /var/cache/conftool/dbconfig/20220907-084105-root.json [08:41:07] PROBLEM - ElasticSearch setting check - 9400 on elastic2052 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] does not match [elastic2054.codfw.wmnet:9700, elastic2076.codfw.wmnet:9700, elastic2080.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [08:41:21] 10SRE-tools, 10Infrastructure-Foundations, 10Release-Engineering-Team: Investigate sharing releng common python code to pywmflib - https://phabricator.wikimedia.org/T316757 (10hashar) Releng might be able to cut a tag a new release which we will then be able to use immediately by bumping the dependency in ou... [08:41:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from x1 master', diff saved to https://phabricator.wikimedia.org/P34061 and previous config saved to /var/cache/conftool/dbconfig/20220907-084133-marostegui.json [08:41:49] RECOVERY - ElasticSearch setting check - 9600 on elastic2076 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [08:42:20] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823680|Move 50% of traffic to php 7.4 (T271736)]] (duration: 04m 00s) [08:42:21] PROBLEM - ElasticSearch setting check - 9400 on elastic2042 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] does not match [elastic2054.codfw.wmnet:9700, elastic2076.codfw.wmnet:9700, elastic2080.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [08:42:24] T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 [08:42:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122 (s2 master) from API', diff saved to https://phabricator.wikimedia.org/P34062 and previous config saved to /var/cache/conftool/dbconfig/20220907-084232-root.json [08:42:47] 10SRE, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10Vgutierrez) [08:42:51] 10SRE, 10Traffic: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10Vgutierrez) 05Open→03Resolved ` $ curl -v -o /dev/null -s https://api.wikimedia.org/feed/v1/wikipedia/en/onthisday/all/09/07 2>&1 | egrep -i "geoip|wmf-last-access"; echo $? 1 ` cl... [08:44:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:44:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pooling db1201 for the first time in s6 T316342', diff saved to https://phabricator.wikimedia.org/P34063 and previous config saved to /var/cache/conftool/dbconfig/20220907-084454-marostegui.json [08:44:55] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:45:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:46:06] (03PS5) 10Samtar: private/readme.php: Add $wgPhonosApiKeyGoogle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825762 (https://phabricator.wikimedia.org/T315491) [08:46:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:47:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 4%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34064 and previous config saved to /var/cache/conftool/dbconfig/20220907-084705-root.json [08:47:12] (03CR) 10Jbond: Spicerack: add configuration file and API key for PeeringDB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [08:48:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1199 (re)pooling @ 10%: Pooling for the first time in s4', diff saved to https://phabricator.wikimedia.org/P34065 and previous config saved to /var/cache/conftool/dbconfig/20220907-084805-root.json [08:49:47] (03CR) 10Muehlenhoff: [C: 03+2] network: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811230 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:51:10] !log rebooting cr2-eqsin to complete JunOS upgrade [08:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34066 and previous config saved to /var/cache/conftool/dbconfig/20220907-085220-root.json [08:53:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: Pooling for the first time in s3', diff saved to https://phabricator.wikimedia.org/P34067 and previous config saved to /var/cache/conftool/dbconfig/20220907-085325-root.json [08:55:19] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:55:39] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 67, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:56:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34068 and previous config saved to /var/cache/conftool/dbconfig/20220907-085610-root.json [08:56:41] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:56:46] o/ I know I asked this before, but I just wanted to check again — ref T315491, I've scheduled https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/825762 which adds a `null` value to `private/readme.php`. During that deployment window do I also edit the real PrivateSettings? Is there a PrivateSettings just for beta? (cc urbanecm because you know everything) [08:56:46] T315491: Add $wgPhonosApiKeyGoogle to PrivateSettings - https://phabricator.wikimedia.org/T315491 [08:57:33] PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:36] (03CR) 10Jbond: "just resiled i fogot to hit send for the comment below" [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott) [08:58:49] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10ayounsi) [09:00:12] TheresNoTime: iirc yes, there's a similar 'private' repo for deployment-prep that lives on the deployment-prep deploy hosts [09:00:25] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:00:33] taavi: thank you :) [09:01:14] TheresNoTime: as taavi says :). readme.php should represent the actual PS.php reasonably-well. [09:01:21] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10taavi) [09:01:42] dare I ask if this is documented somewhre? :) [09:01:54] and the usual caveat of 'private' stuff on deployment-prep applies here as well [09:02:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34069 and previous config saved to /var/cache/conftool/dbconfig/20220907-090210-root.json [09:02:39] RECOVERY - Host cr2-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 286.75 ms [09:03:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1199 (re)pooling @ 25%: Pooling for the first time in s4', diff saved to https://phabricator.wikimedia.org/P34070 and previous config saved to /var/cache/conftool/dbconfig/20220907-090310-root.json [09:04:33] TheresNoTime: want to write some documentation? :P [09:04:48] taavi: for the first time ever, I asked that with the idea that maybe I should :P [09:05:17] * TheresNoTime will document how this goes [09:07:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34071 and previous config saved to /var/cache/conftool/dbconfig/20220907-090725-root.json [09:08:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: Pooling for the first time in s3', diff saved to https://phabricator.wikimedia.org/P34072 and previous config saved to /var/cache/conftool/dbconfig/20220907-090830-root.json [09:09:39] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:50] hm okay so I see `deployment-deploy03:/srv/mediawiki-staging/private$` exists, and there's an uncommitted change to `PrivateSettings.php` — can I "just" make a change there then? [09:10:20] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1013.eqiad.wmnet [09:13:38] (03PS1) 10Btullis: Remove duplicate YAML hash from releases hieradata [puppet] - 10https://gerrit.wikimedia.org/r/830569 [09:14:10] it should be a git repo :/ [09:14:15] 10SRE, 10Infrastructure-Foundations, 10netops, 10Tracking-Neverending: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10Peachey88) [09:14:42] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10hashar) I apologize for my unclear comment, I was referring to the notes taking document at https://docs.google.com/document/d/1Ka9MQB8OwdzAzJVfZua... [09:15:49] it is (: [09:15:56] (03CR) 10Btullis: "I just happened upon this YAML duplication. Assigning reviewers based on git blame." [puppet] - 10https://gerrit.wikimedia.org/r/830569 (owner: 10Btullis) [09:17:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34073 and previous config saved to /var/cache/conftool/dbconfig/20220907-091715-root.json [09:17:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 2%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34074 and previous config saved to /var/cache/conftool/dbconfig/20220907-091740-root.json [09:18:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1199 (re)pooling @ 50%: Pooling for the first time in s4', diff saved to https://phabricator.wikimedia.org/P34075 and previous config saved to /var/cache/conftool/dbconfig/20220907-091815-root.json [09:18:53] (03PS1) 10Marostegui: install_server: Do not reimage db1199 [puppet] - 10https://gerrit.wikimedia.org/r/830570 [09:18:55] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:19:29] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:19:49] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1199 [puppet] - 10https://gerrit.wikimedia.org/r/830570 (owner: 10Marostegui) [09:19:52] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cr3-eqsin with reason: router upgrade [09:20:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr3-eqsin with reason: router upgrade [09:20:12] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=826e80d5-55a6-4bb6-ab1c-e094eba7f6cd) set by cmooney@cumin1001 for 1:00:00 on 1 host(s) and their services with reas... [09:20:39] !log rebooting cr3-eqsin to complete JunOS upgrade [09:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:47] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:21:43] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:22:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34076 and previous config saved to /var/cache/conftool/dbconfig/20220907-092230-root.json [09:23:41] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: spamassassin_updates.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:44] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) > Does it have to be converted to an incident report on Wikitech? It does. > I could do it but could use pairing with someone familiar w... [09:25:09] PROBLEM - Host cr3-eqsin.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:11] PROBLEM - Host cr3-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:25:47] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:26:18] !log restart swift-proxy and repool ms-fe1012 [09:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:21] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:27:22] (03PS1) 10Jbond: P:idp: add addtional loglevel types [puppet] - 10https://gerrit.wikimedia.org/r/830574 [09:28:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37144/console" [puppet] - 10https://gerrit.wikimedia.org/r/830574 (owner: 10Jbond) [09:29:17] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:19] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:31:38] !log pooled parse1013.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [09:31:39] RECOVERY - Host cr3-eqsin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 259.73 ms [09:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:41] RECOVERY - Host cr3-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 259.88 ms [09:31:41] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [09:32:07] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34077 and previous config saved to /var/cache/conftool/dbconfig/20220907-093219-root.json [09:32:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 3%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34078 and previous config saved to /var/cache/conftool/dbconfig/20220907-093247-root.json [09:33:09] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10Vgutierrez) p:05Triage→03Medium [09:33:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1199 (re)pooling @ 75%: Pooling for the first time in s4', diff saved to https://phabricator.wikimedia.org/P34079 and previous config saved to /var/cache/conftool/dbconfig/20220907-093320-root.json [09:35:41] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1013.eqiad.wmnet [09:35:41] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1013.eqiad.wmnet [09:36:17] (03PS2) 10JMeybohm: Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251) [09:36:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10Tracking-Neverending: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10ayounsi) [09:37:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34080 and previous config saved to /var/cache/conftool/dbconfig/20220907-093736-root.json [09:41:07] (03PS1) 10Cathal Mooney: Revert "Depool eqsin for core router upgrades." [dns] - 10https://gerrit.wikimedia.org/r/830575 (https://phabricator.wikimedia.org/T295690) [09:44:29] !log depooled wtp1046.eqiad.wmnet from parsoid cluster T307219 [09:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:32] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [09:44:59] (03CR) 10Ayounsi: [C: 03+1] Revert "Depool eqsin for core router upgrades." [dns] - 10https://gerrit.wikimedia.org/r/830575 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [09:46:29] (03CR) 10Cathal Mooney: [C: 03+2] Revert "Depool eqsin for core router upgrades." [dns] - 10https://gerrit.wikimedia.org/r/830575 (https://phabricator.wikimedia.org/T295690) (owner: 10Cathal Mooney) [09:47:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/830574 (owner: 10Jbond) [09:47:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34081 and previous config saved to /var/cache/conftool/dbconfig/20220907-094724-root.json [09:47:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 4%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34082 and previous config saved to /var/cache/conftool/dbconfig/20220907-094752-root.json [09:48:09] !log Re-pooling eqsin for user traffic after successful core router upgrades - T295690 [09:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:12] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [09:48:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1199 (re)pooling @ 100%: Pooling for the first time in s4', diff saved to https://phabricator.wikimedia.org/P34083 and previous config saved to /var/cache/conftool/dbconfig/20220907-094825-root.json [09:50:07] 10SRE, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10fgiunchedi) p:05Triage→03Medium [09:52:43] (03PS3) 10Clément Goubert: scap/cumin: switch parsoid eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/830193 (https://phabricator.wikimedia.org/T307219) [09:53:23] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1014.eqiad.wmnet [09:54:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] scap/cumin: switch parsoid eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/830193 (https://phabricator.wikimedia.org/T307219) (owner: 10Clément Goubert) [09:57:10] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1014.eqiad.wmnet [09:57:11] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1014.eqiad.wmnet [09:57:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp: add addtional loglevel types [puppet] - 10https://gerrit.wikimedia.org/r/830574 (owner: 10Jbond) [10:02:02] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10User-Ryasmeen, 10Wikimedia-Incident: Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10jbond) 05Open→03Resolved a:03jbond @Zabe thanks for the detailed summary, it looks like... [10:02:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34084 and previous config saved to /var/cache/conftool/dbconfig/20220907-100229-root.json [10:02:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34085 and previous config saved to /var/cache/conftool/dbconfig/20220907-100257-root.json [10:05:18] !log pooled parse1014.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [10:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:23] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [10:07:10] (03PS1) 10Jbond: raid: fix raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830578 [10:08:17] (03PS2) 10Jbond: raid: fix raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830578 (https://phabricator.wikimedia.org/T315608) [10:10:46] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1014.eqiad.wmnet [10:10:54] (03CR) 10Muehlenhoff: raid: fix raid_mgmt_tools fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830578 (https://phabricator.wikimedia.org/T315608) (owner: 10Jbond) [10:11:04] (03PS1) 10Btullis: Add an entry for the cfssl-issuer service to the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/830580 (https://phabricator.wikimedia.org/T310175) [10:12:06] !log repooled parse1014.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [10:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:09] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [10:12:34] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1015.eqiad.wmnet [10:12:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2120 to clone db2122', diff saved to https://phabricator.wikimedia.org/P34086 and previous config saved to /var/cache/conftool/dbconfig/20220907-101258-root.json [10:13:32] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37145/console" [puppet] - 10https://gerrit.wikimedia.org/r/830580 (https://phabricator.wikimedia.org/T310175) (owner: 10Btullis) [10:14:12] jouncebot: now [10:14:12] No deployments scheduled for the next 2 hour(s) and 45 minute(s) [10:15:08] (03PS3) 10Jbond: raid: fix raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830578 (https://phabricator.wikimedia.org/T315608) [10:15:28] (03CR) 10Jbond: raid: fix raid_mgmt_tools fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830578 (https://phabricator.wikimedia.org/T315608) (owner: 10Jbond) [10:16:15] I’ll deploy the config change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/829020 if nobody minds [10:17:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1200 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34088 and previous config saved to /var/cache/conftool/dbconfig/20220907-101734-root.json [10:17:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable sitelinks to redirects on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829020 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [10:17:59] (03PS3) 10Lucas Werkmeister (WMDE): Enable sitelinks to redirects on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829020 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [10:18:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34089 and previous config saved to /var/cache/conftool/dbconfig/20220907-101801-root.json [10:21:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable sitelinks to redirects on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829020 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [10:21:45] !log depooled wtp1047.eqiad.wmnet from parsoid cluster T307219 [10:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:49] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [10:22:10] (03Merged) 10jenkins-bot: Enable sitelinks to redirects on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829020 (https://phabricator.wikimedia.org/T316637) (owner: 10Michael Große) [10:22:37] testing on mwdebug1001 (might take a few minutes) [10:22:39] (03PS3) 10Cathal Mooney: Sub-delegation of reverse DNS entries for 185.15.57.16/29 to WMCS [dns] - 10https://gerrit.wikimedia.org/r/826803 (https://phabricator.wikimedia.org/T315955) [10:23:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/830578 (https://phabricator.wikimedia.org/T315608) (owner: 10Jbond) [10:25:03] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1015.eqiad.wmnet [10:25:03] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1015.eqiad.wmnet [10:26:23] !log pooled parse1015.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [10:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:41] my config change seems to work fine, I’ll sync it [10:27:23] (03CR) 10Hnowlan: [C: 03+2] api-gateway: don't conditionally rewrite if asked, always do it [deployment-charts] - 10https://gerrit.wikimedia.org/r/830196 (owner: 10Hnowlan) [10:27:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:27:49] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1016.eqiad.wmnet [10:28:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:28:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:29:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:30:23] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Patch-For-Review: icinga raid monitoring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10Volans) [10:31:06] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:829020|Enable sitelinks to redirects on testwikidatawiki (T316637)]] (duration: 03m 51s) [10:31:09] T316637: Add configuration for redirect badges on production wikidatawiki - https://phabricator.wikimedia.org/T316637 [10:31:28] alright, I’m done :) [10:31:35] (03Merged) 10jenkins-bot: api-gateway: don't conditionally rewrite if asked, always do it [deployment-charts] - 10https://gerrit.wikimedia.org/r/830196 (owner: 10Hnowlan) [10:31:51] (03CR) 10Cathal Mooney: [C: 03+2] Sub-delegation of reverse DNS entries for 185.15.57.16/29 to WMCS [dns] - 10https://gerrit.wikimedia.org/r/826803 (https://phabricator.wikimedia.org/T315955) (owner: 10Cathal Mooney) [10:33:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34091 and previous config saved to /var/cache/conftool/dbconfig/20220907-103306-root.json [10:35:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10Tracking-Neverending: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10ayounsi) [10:35:55] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Just to confirm the change from CNAME back to A records has worked, my BIND server at home is able to resolve WMCS names again. In... [10:36:24] (03PS1) 10Filippo Giunchedi: sre: check apache_up too as part of AppserversUnreachable [alerts] - 10https://gerrit.wikimedia.org/r/830582 (https://phabricator.wikimedia.org/T314118) [10:36:26] (03PS1) 10Filippo Giunchedi: team-traffic: fix double annotation [alerts] - 10https://gerrit.wikimedia.org/r/830583 [10:36:30] !log depooled wtp1048.eqiad.wmnet from parsoid cluster T307219 [10:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:35] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [10:37:17] (03CR) 10Filippo Giunchedi: "As per Giuseppe's feedback on task" [alerts] - 10https://gerrit.wikimedia.org/r/830582 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [10:39:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1016.eqiad.wmnet [10:39:23] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1016.eqiad.wmnet [10:40:04] !log pooled parse1016.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [10:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:04] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add an entry for the cfssl-issuer service to the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/830580 (https://phabricator.wikimedia.org/T310175) (owner: 10Btullis) [10:48:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34092 and previous config saved to /var/cache/conftool/dbconfig/20220907-104811-root.json [10:48:28] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1017.eqiad.wmnet [10:49:07] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) Thank you for the feedback! >>! In T314118#8205484, @Joe wrote: > Regarding the appserver alerts, I th... [10:49:58] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:50:39] RECOVERY - memcached socket on parse1017 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [10:52:25] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1044-1046].eqiad.wmnet with reason: Downtiming replaced wtp servers [10:52:41] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1044-1046].eqiad.wmnet with reason: Downtiming replaced wtp servers [10:53:09] (03CR) 10Jbond: [C: 03+2] raid: fix raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830578 (https://phabricator.wikimedia.org/T315608) (owner: 10Jbond) [10:53:30] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1044.eqiad.wmnet [10:53:38] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1045.eqiad.wmnet [10:53:46] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1046.eqiad.wmnet [10:54:03] (03PS1) 10Giuseppe Lavagetto: Add cookbook to easily route mediawiki traffic [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) [10:55:03] (03PS1) 10Hnowlan: Show deprecation warnings [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/830608 [10:57:25] (03CR) 10CI reject: [V: 04-1] Add cookbook to easily route mediawiki traffic [cookbooks] - 10https://gerrit.wikimedia.org/r/830607 (https://phabricator.wikimedia.org/T315995) (owner: 10Giuseppe Lavagetto) [10:59:07] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [10:59:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [10:59:39] (03PS8) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) [10:59:48] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [11:00:16] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [11:00:22] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 7 hosts with reason: Downtime pending inclusion in production [11:00:29] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 7 hosts with reason: Downtime pending inclusion in production [11:00:55] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [11:01:07] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [11:01:36] (03CR) 10AOkoth: C:spamassassin Allow debugging of why service fails. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [11:01:49] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [11:01:51] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [11:02:31] RECOVERY - mediawiki-installation DSH group on parse1017 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:03:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34093 and previous config saved to /var/cache/conftool/dbconfig/20220907-110316-root.json [11:04:44] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [11:05:10] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [11:10:12] (03PS5) 10Jbond: Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [11:12:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) Upgrade completed ok for cr2-eqsin and cr3-eqsin. Went straight to 21.2R3-S2.9 based on experience in ulsfo, all went ok. Used no-validate when addi... [11:13:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) [11:17:17] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:18:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P34094 and previous config saved to /var/cache/conftool/dbconfig/20220907-111821-root.json [11:22:37] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:25:43] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [11:26:21] 10SRE, 10Infrastructure-Foundations, 10netops, 10Tracking-Neverending: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10ayounsi) [11:28:36] (03CR) 10Vgutierrez: [C: 03+1] team-traffic: fix double annotation [alerts] - 10https://gerrit.wikimedia.org/r/830583 (owner: 10Filippo Giunchedi) [11:30:01] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:33:09] (03CR) 10Jbond: [C: 03+2] realm.pp: Add defaults for file [puppet] - 10https://gerrit.wikimedia.org/r/809095 (owner: 10Jbond) [11:34:54] !log change default puppet file permissions ro root:root [11:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:27] (03PS1) 10Jbond: C:phabricator: fix user permissions [puppet] - 10https://gerrit.wikimedia.org/r/830611 [11:40:25] (03PS1) 10Jbond: C:ferm: set group perms to adm [puppet] - 10https://gerrit.wikimedia.org/r/830612 [11:40:39] (03CR) 10Jbond: [V: 03+2 C: 03+2] C:ferm: set group perms to adm [puppet] - 10https://gerrit.wikimedia.org/r/830612 (owner: 10Jbond) [11:41:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37146/console" [puppet] - 10https://gerrit.wikimedia.org/r/830611 (owner: 10Jbond) [11:41:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:phabricator: fix user permissions [puppet] - 10https://gerrit.wikimedia.org/r/830611 (owner: 10Jbond) [11:41:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2120 (re)pooling @ 5%: Pooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34095 and previous config saved to /var/cache/conftool/dbconfig/20220907-114142-root.json [11:41:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 1%: Pooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34096 and previous config saved to /var/cache/conftool/dbconfig/20220907-114154-root.json [11:42:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] C:phabricator: fix user permissions [puppet] - 10https://gerrit.wikimedia.org/r/830611 (owner: 10Jbond) [11:42:16] (03PS1) 10Marostegui: Revert "db2122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830591 [11:44:26] (03CR) 10Marostegui: [C: 03+2] Revert "db2122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/830591 (owner: 10Marostegui) [11:48:09] PROBLEM - cinder-scheduler process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:48:29] PROBLEM - cinder-api http on cloudcontrol1005 is CRITICAL: connect to address 208.80.154.85 and port 18776: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:48:39] PROBLEM - cinder-volume process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:49:11] PROBLEM - cinder-api http on cloudcontrol1007 is CRITICAL: connect to address 208.80.155.104 and port 18776: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:49:53] PROBLEM - cinder-scheduler process on cloudcontrol1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:50:13] PROBLEM - cinder-volume process on cloudcontrol1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:52:23] (03PS1) 10Jbond: C:prometheus::ipmi_exporter: update permissions to use prometheus user [puppet] - 10https://gerrit.wikimedia.org/r/830616 [11:53:00] (03CR) 10Jbond: [V: 03+2 C: 03+2] C:prometheus::ipmi_exporter: update permissions to use prometheus user [puppet] - 10https://gerrit.wikimedia.org/r/830616 (owner: 10Jbond) [11:56:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2120 (re)pooling @ 10%: Pooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34097 and previous config saved to /var/cache/conftool/dbconfig/20220907-115647-root.json [11:56:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 5%: Pooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34098 and previous config saved to /var/cache/conftool/dbconfig/20220907-115659-root.json [11:57:15] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:58:03] (03PS1) 10Jbond: C:query_service::/deploy::scap: correct username post [puppet] - 10https://gerrit.wikimedia.org/r/830618 [11:58:22] (03CR) 10Jbond: [V: 03+2 C: 03+2] C:query_service::/deploy::scap: correct username post [puppet] - 10https://gerrit.wikimedia.org/r/830618 (owner: 10Jbond) [12:02:11] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:05:49] (03PS1) 10Jbond: C:ceph: ensure that the ceph keyring folder gets the correct owner/group [puppet] - 10https://gerrit.wikimedia.org/r/830619 [12:05:57] PROBLEM - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: cinder-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:15] PROBLEM - cinder-api http on cloudcontrol1006 is CRITICAL: connect to address 208.80.154.149 and port 18776: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:06:18] (03CR) 10Jbond: [V: 03+2 C: 03+2] C:ceph: ensure that the ceph keyring folder gets the correct owner/group [puppet] - 10https://gerrit.wikimedia.org/r/830619 (owner: 10Jbond) [12:06:29] PROBLEM - cinder-volume process on cloudcontrol1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:06:35] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02847 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:06:43] * jbond lookinh [12:07:27] PROBLEM - cinder-scheduler process on cloudcontrol1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:08:15] !log disable puppet fleet wide to fix issues [12:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:53] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [12:11:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2120 (re)pooling @ 25%: Pooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34099 and previous config saved to /var/cache/conftool/dbconfig/20220907-121152-root.json [12:12:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: Pooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34100 and previous config saved to /var/cache/conftool/dbconfig/20220907-121204-root.json [12:13:38] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MoritzMuehlenhoff) This is blocking the migration of the two remaining Swift frontends away from Stretch. It seems an alternative replication via rclone will replace it,... [12:16:36] PROBLEM - cinder-volume process on cloudcontrol1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:17:44] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.09411 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:20:02] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01255 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:22:38] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:26:10] (03PS1) 10Jbond: P:puppetmaster: ensure dire is readable by puppet [puppet] - 10https://gerrit.wikimedia.org/r/830622 [12:26:24] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppetmaster: ensure dire is readable by puppet [puppet] - 10https://gerrit.wikimedia.org/r/830622 (owner: 10Jbond) [12:26:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2120 (re)pooling @ 50%: Pooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34101 and previous config saved to /var/cache/conftool/dbconfig/20220907-122656-root.json [12:27:05] (03PS1) 10Filippo Giunchedi: icinga: fix dir group ownership [puppet] - 10https://gerrit.wikimedia.org/r/830623 [12:27:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: Pooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34102 and previous config saved to /var/cache/conftool/dbconfig/20220907-122708-root.json [12:27:17] jbond: ^ [12:27:20] !log installing runc security updates on codfw staging hosts [12:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:38] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830623 (owner: 10Filippo Giunchedi) [12:27:41] goithx +1 [12:27:54] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:27:58] godog: thx +1 [12:28:08] (03PS1) 10JMeybohm: Alert in high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) [12:28:09] np! waiting for CI and then merging [12:28:15] (03CR) 10Filippo Giunchedi: [C: 03+2] team-traffic: fix double annotation [alerts] - 10https://gerrit.wikimedia.org/r/830583 (owner: 10Filippo Giunchedi) [12:28:21] (03PS2) 10Filippo Giunchedi: team-traffic: fix double annotation [alerts] - 10https://gerrit.wikimedia.org/r/830583 [12:28:27] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: fix dir group ownership [puppet] - 10https://gerrit.wikimedia.org/r/830623 (owner: 10Filippo Giunchedi) [12:29:19] (03CR) 10Filippo Giunchedi: [V: 03+2] team-traffic: fix double annotation [alerts] - 10https://gerrit.wikimedia.org/r/830583 (owner: 10Filippo Giunchedi) [12:29:57] (03PS6) 10Muehlenhoff: Initially adapt perccli to use the new raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/825369 (https://phabricator.wikimedia.org/T315608) [12:31:52] !log re-enable puppet [12:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:54] (03PS2) 10JMeybohm: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) [12:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:40:14] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003861 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:42:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2120 (re)pooling @ 75%: Pooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34103 and previous config saved to /var/cache/conftool/dbconfig/20220907-124201-root.json [12:42:09] (03PS1) 10Filippo Giunchedi: icinga: fix icinga.log perms [puppet] - 10https://gerrit.wikimedia.org/r/830626 [12:42:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: Pooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34104 and previous config saved to /var/cache/conftool/dbconfig/20220907-124213-root.json [12:43:34] PROBLEM - cinder-api http on cloudcontrol1006 is CRITICAL: connect to address 208.80.154.149 and port 18776: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:44:06] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: fix icinga.log perms [puppet] - 10https://gerrit.wikimedia.org/r/830626 (owner: 10Filippo Giunchedi) [12:44:54] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003378 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:46:36] PROBLEM - cinder-scheduler process on cloudcontrol1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:46:58] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:49:10] (03PS1) 10Jbond: cinder: fix file perms [puppet] - 10https://gerrit.wikimedia.org/r/830628 [12:51:42] PROBLEM - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: cinder-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:30] (03PS1) 10Jbond: Revert "C:ceph: ensure that the ceph keyring folder gets the correct owner/group" [puppet] - 10https://gerrit.wikimedia.org/r/830592 [12:55:24] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/830592 (owner: 10Jbond) [12:55:29] (03CR) 10Jbond: [C: 03+2] cinder: fix file perms [puppet] - 10https://gerrit.wikimedia.org/r/830628 (owner: 10Jbond) [12:55:45] (03CR) 10Jbond: [C: 03+2] Revert "C:ceph: ensure that the ceph keyring folder gets the correct owner/group" [puppet] - 10https://gerrit.wikimedia.org/r/830592 (owner: 10Jbond) [12:57:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2120 (re)pooling @ 100%: Pooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P34105 and previous config saved to /var/cache/conftool/dbconfig/20220907-125706-root.json [12:57:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: Pooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34106 and previous config saved to /var/cache/conftool/dbconfig/20220907-125718-root.json [12:58:08] RECOVERY - cinder-scheduler process on cloudcontrol1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:59:02] (03PS1) 10Clément Goubert: deployment-prep: Add C:mediawiki::packages::beta [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) [12:59:41] (03PS2) 10Clément Goubert: deployment-prep: Add P:beta::mediawiki_packages [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) [12:59:58] RECOVERY - cinder-api http on cloudcontrol1005 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 535 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:00:02] RECOVERY - cinder-volume process on cloudcontrol1005 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220907T1300). [13:00:05] sergi0, TheresNoTime, and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] * TheresNoTime is here [13:00:28] RECOVERY - Check systemd state on cloudbackup2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:29] hello! [13:01:17] I’m about to leave and can’t deploy, sorry [13:01:36] RECOVERY - cinder-scheduler process on cloudcontrol1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:02:17] I can deploy, but will give urbanecm a few more minutes :) [13:02:22] RECOVERY - cinder-api http on cloudcontrol1006 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 536 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:02:24] RECOVERY - cinder-volume process on cloudcontrol1007 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:03:02] RECOVERY - cinder-volume process on cloudcontrol1006 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:03:20] sounds good! [13:03:37] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37147/console" [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert) [13:03:42] (03CR) 10CI reject: [V: 04-1] deployment-prep: Add P:beta::mediawiki_packages [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert) [13:03:50] RECOVERY - cinder-api http on cloudcontrol1007 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 536 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:05:23] (03PS3) 10Clément Goubert: deployment-prep: Add P:beta::mediawiki_packages [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) [13:05:37] urbanecm: I am going to deploy :) [13:06:08] RECOVERY - cinder-scheduler process on cloudcontrol1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:06:10] TheresNoTime: thank you :) [13:06:15] (03PS4) 10Clément Goubert: deployment-prep: Add P:beta::mediawiki_packages [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) [13:06:32] (03CR) 10Samtar: [C: 03+2] Mentee overview(vue): prevent clicks on more recent edit buttons to submit the filters [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830199 (https://phabricator.wikimedia.org/T316926) (owner: 10Sergio Gimeno) [13:08:18] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37148/console" [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert) [13:10:59] sergi0: just waiting on your patch to merge (cc sergi0_) [13:12:19] (03PS2) 10Phuedx: beta: $wgIPInfoGeoIP2Prefix -> $wgIPInfoGeoLite2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482 [13:12:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: Pooling after upgrade', diff saved to https://phabricator.wikimedia.org/P34107 and previous config saved to /var/cache/conftool/dbconfig/20220907-131223-root.json [13:12:49] TheresNoTime: alright, ty. The change is a no-op in production wikis. [13:13:14] TheresNoTime: I have an unstable connection in case you see me dropping :/ [13:13:48] no worries :) [13:20:19] (you might find something like https://www.irccloud.com useful sergi0) [13:23:18] TheresNoTime: Thank you, I will try it, my apologies I'm a terribly clumsy with irc :/ [13:23:52] No apologies necessary :) you'll probably find IRCCloud a little easier too then [13:23:59] almost merged your patch [13:25:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Almost ready! 2 inline comments and it will be ready as far as I am concerned." [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [13:30:02] (03PS1) 10Btullis: Add the configuration to create LVM volumes for dse-k8s monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) [13:30:22] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [13:30:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: power supply alert for cloudcephosd1031.eqiad.wmnet - https://phabricator.wikimedia.org/T317127 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr Reseated power cable [13:31:45] (03Merged) 10jenkins-bot: Mentee overview(vue): prevent clicks on more recent edit buttons to submit the filters [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830199 (https://phabricator.wikimedia.org/T316926) (owner: 10Sergio Gimeno) [13:33:27] sergi0: am I correct that this is a no-op/no test change? [13:34:15] (ah yes you said as much earlier) [13:34:26] TheresNoTime: it is. The code branch is unreachable. [13:34:39] syncing now :) [13:35:38] (03PS6) 10Samtar: private/readme.php: Add $wgPhonosApiKeyGoogle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825762 (https://phabricator.wikimedia.org/T315491) [13:35:43] (03PS2) 10Samtar: CommonSettings-labs: Set config to production-esque values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830567 (https://phabricator.wikimedia.org/T314294) [13:36:29] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host rdb1009.eqiad.wmnet with OS bullseye [13:36:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:37:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:37:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:38:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:49] !log samtar@deploy1002 Synchronized php-1.39.0-wmf.27/extensions/GrowthExperiments/modules/ext.growthExperiments.MentorDashboard.Vue/components/MenteeOverview/MenteeFiltersForm.vue: Backport: [[gerrit:830199|Mentee overview(vue): prevent clicks on more recent edit buttons to submit the filters (T316926)]] (duration: 04m 07s) [13:38:52] T316926: Click on a most recent edit value submits all filters - https://phabricator.wikimedia.org/T316926 [13:40:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825762 (https://phabricator.wikimedia.org/T315491) (owner: 10Samtar) [13:40:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) Confirmed: Service Request 151038642 was successfully submitted. Created dell ticket [13:41:34] (03Merged) 10jenkins-bot: private/readme.php: Add $wgPhonosApiKeyGoogle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/825762 (https://phabricator.wikimedia.org/T315491) (owner: 10Samtar) [13:42:07] !log samtar@deploy1002 Started scap: Backport for [[gerrit:825762|private/readme.php: Add $wgPhonosApiKeyGoogle (T315491)]] [13:42:10] T315491: Add $wgPhonosApiKeyGoogle to PrivateSettings - https://phabricator.wikimedia.org/T315491 [13:42:35] !log samtar@deploy1002 samtar and samtar: Backport for [[gerrit:825762|private/readme.php: Add $wgPhonosApiKeyGoogle (T315491)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:42:44] TheresNoTime: thank you for deploying and the tips :) [13:42:55] sergi0: no worried :) [13:43:55] *s [13:44:14] (03PS3) 10Samtar: CommonSettings-labs: Set config to production-esque values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830567 (https://phabricator.wikimedia.org/T314294) [13:46:58] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:825762|private/readme.php: Add $wgPhonosApiKeyGoogle (T315491)]] (duration: 04m 51s) [13:47:15] (03CR) 10Samtar: [C: 03+2] "deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830567 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [13:48:05] (03Merged) 10jenkins-bot: CommonSettings-labs: Set config to production-esque values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830567 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [13:48:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:49:03] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1009.eqiad.wmnet with reason: host reimage [13:50:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) Thanks @Jclark-ctr -- FYI I have temporarily removed this node from the Ceph cluster, so it can be safely re... [13:50:45] (03CR) 10Filippo Giunchedi: Add the configuration to create LVM volumes for dse-k8s monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [13:51:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) @fnegri i did rerun support log and it did not show any errors that i noticed i still opened support ti... [13:52:01] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1009.eqiad.wmnet with reason: host reimage [13:53:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:53:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:53:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:54:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maint [13:54:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maint [13:55:57] !log samtar@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:830567|CommonSettings-labs: Set config to production-esque values (T314294)]] (duration: 03m 47s) [13:55:59] T314294: Deploy Phonos to beta cluster - https://phabricator.wikimedia.org/T314294 [13:56:31] !log UTC afternoon backport window closed [13:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:11] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:57:11] (03CR) 10Ssingh: [C: 03+1] "For dns2001, PCC output:" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [13:57:15] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:58:09] `scap backport` is very nice :) [13:59:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:00:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:00:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:01:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:02:42] !log depooled wtp1027.eqiad.wmnet from parsoid cluster T307219 [14:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:46] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:04:27] (03CR) 10Samtar: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830629 (https://phabricator.wikimedia.org/T317128) (owner: 10Clément Goubert) [14:05:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 5%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34108 and previous config saved to /var/cache/conftool/dbconfig/20220907-140503-root.json [14:07:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 1%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34109 and previous config saved to /var/cache/conftool/dbconfig/20220907-140758-root.json [14:08:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T312863)', diff saved to https://phabricator.wikimedia.org/P34110 and previous config saved to /var/cache/conftool/dbconfig/20220907-140813-ladsgroup.json [14:08:17] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [14:08:20] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1009.eqiad.wmnet with OS bullseye [14:10:39] (03PS4) 10Eevans: cassandra: Create new role for testing AQS bulk-loader changes [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) [14:11:44] (03PS5) 10Eevans: cassandra: Create new role for testing AQS bulk-loader changes [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) [14:12:43] (03PS2) 10Reedy: wikimaniawiki: create 2023 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830248 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [14:12:55] (03CR) 10Reedy: [C: 03+2] wikimaniawiki: create 2023 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830248 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [14:13:14] (03PS2) 10Reedy: wikimaniawiki: update default searched namespace for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830249 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [14:13:18] (03CR) 10Reedy: [C: 03+2] wikimaniawiki: update default searched namespace for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830249 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [14:13:22] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) Finally installed the new package also on db1127 (s7) and will repool it tomorrow. [14:13:29] (03PS2) 10Reedy: wikimaniawiki: enable Visual Editor on 2023 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830250 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [14:13:33] (03CR) 10Reedy: [C: 03+2] wikimaniawiki: enable Visual Editor on 2023 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830250 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [14:13:58] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) (owner: 10Eevans) [14:14:05] (03Merged) 10jenkins-bot: wikimaniawiki: create 2023 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830248 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [14:14:10] (03Merged) 10jenkins-bot: wikimaniawiki: update default searched namespace for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830249 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [14:14:26] (03Merged) 10jenkins-bot: wikimaniawiki: enable Visual Editor on 2023 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830250 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [14:16:24] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:17:34] jouncebot: now [14:17:34] No deployments scheduled for the next 3 hour(s) and 42 minute(s) [14:20:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 10%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34111 and previous config saved to /var/cache/conftool/dbconfig/20220907-142008-root.json [14:20:21] (03CR) 10Clément Goubert: [C: 03+2] scap/cumin: switch parsoid eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/830193 (https://phabricator.wikimedia.org/T307219) (owner: 10Clément Goubert) [14:20:25] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) (owner: 10Eevans) [14:20:38] !log Switching canaries for parsoid eqiad T307219 [14:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:40] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:21:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:22:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:22:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:23:02] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:23:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 5%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34112 and previous config saved to /var/cache/conftool/dbconfig/20220907-142303-root.json [14:23:08] !log akosiaris@cumin1001 START - Cookbook sre.hosts.reimage for host rdb1010.eqiad.wmnet with OS bullseye [14:23:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:23:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P34113 and previous config saved to /var/cache/conftool/dbconfig/20220907-142321-ladsgroup.json [14:23:42] !log cgoubert@puppetmaster1001 conftool action : set/weight=1; selector: dc=eqiad,cluster=parsoid,service=canary [14:23:44] (03PS3) 10JMeybohm: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) [14:23:46] (03PS1) 10JMeybohm: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) [14:23:49] (03PS2) 10CDanis: WIP: Basic seeding of an oncall handoff message [software/klaxon] - 10https://gerrit.wikimedia.org/r/830259 (https://phabricator.wikimedia.org/T317159) [14:24:04] (03CR) 10Eevans: [C: 03+2] cassandra: Create new role for testing AQS bulk-loader changes [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) (owner: 10Eevans) [14:27:56] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) [14:29:36] !log installing runc security updates on k8s servers [14:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:23] (03PS1) 10Jbond: C:prometheus: fix permissions for prometheus directory [puppet] - 10https://gerrit.wikimedia.org/r/830640 [14:32:36] !log parsoid eqiad canaries switched to parse1001 and parse1002 T307219 [14:32:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:40] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:35:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34114 and previous config saved to /var/cache/conftool/dbconfig/20220907-143513-root.json [14:35:14] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [14:35:50] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1017.eqiad.wmnet [14:35:50] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1017.eqiad.wmnet [14:35:57] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on rdb1010.eqiad.wmnet with reason: host reimage [14:36:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin1001" [14:37:16] !log pooled parse1017.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [14:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34115 and previous config saved to /var/cache/conftool/dbconfig/20220907-143808-root.json [14:38:26] 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10Aklapper) a:05MMandere→03None Removing inactive assignee (please do so as part of offboarding - thanks!) [14:38:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P34116 and previous config saved to /var/cache/conftool/dbconfig/20220907-143828-ladsgroup.json [14:38:33] 10SRE, 10Traffic: Clean up Traffic Grafana dashboards to reflect HA-Proxy metrics - https://phabricator.wikimedia.org/T304153 (10Aklapper) a:05MMandere→03None Removing inactive assignee (please do so as part of offboarding - thanks!) [14:39:07] (03CR) 10Alexandros Kosiaris: [C: 04-1] Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:39:19] 10SRE, 10Traffic, 10Patch-For-Review: Create program to interact with Atlas RIPE API - https://phabricator.wikimedia.org/T315536 (10Aklapper) a:05MMandere→03None Removing inactive assignee (please do so as part of offboarding - thanks!) [14:39:34] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on rdb1010.eqiad.wmnet with reason: host reimage [14:40:39] (03CR) 10Ori: [C: 03+1] Sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [14:40:55] ^ vgutierrez [14:41:08] (is there a reason to suspect it will be a bottleneck?) [14:41:37] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1018.eqiad.wmnet [14:41:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:42:51] 10SRE, 10Wikimedia-Mailing-lists: Archive coolest-tool-academy mailing list - https://phabricator.wikimedia.org/T317185 (10Legoktm) Hi @Aklapper, can you explain why the Academy is moving from free software tool to a proprietary one? I know that the Coolest Tool Award's use of proprietary software (Google Form... [14:44:12] (03PS1) 10Zabe: beta: Remove deployment-parsoid11 from wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830641 [14:44:25] (03CR) 10JMeybohm: Alert on high Kubernetes API error rate (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) (owner: 10JMeybohm) [14:44:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P34117 and previous config saved to /var/cache/conftool/dbconfig/20220907-144434-ladsgroup.json [14:44:36] (03PS1) 10Jbond: P:ci::slave: add group to homedir [puppet] - 10https://gerrit.wikimedia.org/r/830642 [14:45:55] (03CR) 10Giuseppe Lavagetto: Sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [14:46:08] (03PS1) 10JMeybohm: kuberneste: Remove obsolete monitoring::check_prometheus resources [puppet] - 10https://gerrit.wikimedia.org/r/830643 (https://phabricator.wikimedia.org/T311251) [14:46:10] (03PS1) 10JMeybohm: prometheus: Remove obsolete recording rules [puppet] - 10https://gerrit.wikimedia.org/r/830644 (https://phabricator.wikimedia.org/T311251) [14:46:38] :D kuberneste is nice [14:47:17] (03PS1) 10Muehlenhoff: Fix entries in raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830645 (https://phabricator.wikimedia.org/T315608) [14:47:24] (03PS1) 10Andrew Bogott: Rename 'Galera haproxy failover' alert [puppet] - 10https://gerrit.wikimedia.org/r/830646 [14:47:45] (03CR) 10Vgutierrez: [C: 03+1] Sort query parameters in URLs [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [14:48:35] !log depooled wtp1025.eqiad.wmnet from parsoid cluster T307219 [14:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:38] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:48:38] (03PS2) 10JMeybohm: kubernetes: Remove obsolete monitoring::check_prometheus resources [puppet] - 10https://gerrit.wikimedia.org/r/830643 (https://phabricator.wikimedia.org/T311251) [14:48:40] (03PS2) 10JMeybohm: prometheus: Remove obsolete recording rules [puppet] - 10https://gerrit.wikimedia.org/r/830644 (https://phabricator.wikimedia.org/T311251) [14:48:47] (03PS2) 10Muehlenhoff: Fix entries in raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830645 (https://phabricator.wikimedia.org/T315608) [14:50:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 50%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34118 and previous config saved to /var/cache/conftool/dbconfig/20220907-145018-root.json [14:50:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10Aklapper) Assuming this task is not literally neverending (if it was, it should be a project tag instead) [14:50:52] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1018.eqiad.wmnet [14:50:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1018.eqiad.wmnet [14:51:23] (03PS1) 10Jbond: P:analytics::refinery::job::gobblin_job: fix permissions [puppet] - 10https://gerrit.wikimedia.org/r/830647 [14:52:38] !log pooled parse1018.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [14:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34119 and previous config saved to /var/cache/conftool/dbconfig/20220907-145313-root.json [14:54:46] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host rdb1010.eqiad.wmnet with OS bullseye [14:56:39] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Wikimania 2023 setup T316928 (duration: 04m 04s) [14:56:42] T316928: Wikimania wiki preparations for 2023 - https://phabricator.wikimedia.org/T316928 [14:57:33] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Sort query parameters in URLs [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [14:58:50] (03CR) 10Filippo Giunchedi: [C: 03+1] C:prometheus: fix permissions for prometheus directory [puppet] - 10https://gerrit.wikimedia.org/r/830640 (owner: 10Jbond) [14:59:49] 10SRE, 10Wikimedia-Mailing-lists: Archive coolest-tool-academy mailing list - https://phabricator.wikimedia.org/T317185 (10Aklapper) @Legoktm: That is a great question for @BMueller (or maybe @mseckington), as I am not aware of huge advantages either. [15:01:44] PROBLEM - Disk space on apt1001 is CRITICAL: DISK CRITICAL - free space: / 6635 MB (3% inode=98%): /tmp 6635 MB (3% inode=98%): /var/tmp 6635 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=apt1001&var-datasource=eqiad+prometheus/ops [15:02:47] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [15:03:39] (03CR) 10Ssingh: [C: 03+1] "I see both versions in our code (at least in the py files in the Puppet repository). Since we specify python3 specifically, I think we sho" [puppet] - 10https://gerrit.wikimedia.org/r/818234 (owner: 10BCornwall) [15:05:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 75%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34120 and previous config saved to /var/cache/conftool/dbconfig/20220907-150523-root.json [15:05:29] (03CR) 10Filippo Giunchedi: Add the configuration to create LVM volumes for dse-k8s monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [15:06:01] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [15:06:33] (03CR) 10Ssingh: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [15:06:38] hey, someone here willing to merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830641 for me at their convenience? :) [15:06:43] (it's a beta-only change) [15:06:44] (03CR) 10BCornwall: [C: 03+2] acme-chief: use /usr/bin/env as python interpreter [puppet] - 10https://gerrit.wikimedia.org/r/818234 (owner: 10BCornwall) [15:07:18] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [15:07:19] (03CR) 10Ssingh: [V: 03+1] "To be merged when ATS9 is rolled out across all hosts." [puppet] - 10https://gerrit.wikimedia.org/r/803285 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:07:31] !log depooled wtp1026.eqiad.wmnet from parsoid cluster T307219 [15:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:34] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [15:08:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34121 and previous config saved to /var/cache/conftool/dbconfig/20220907-150817-root.json [15:09:42] (03PS2) 10BCornwall: varnish: Remove extraneous checks for Docker [puppet] - 10https://gerrit.wikimedia.org/r/826367 [15:10:07] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1027,1047-1048].eqiad.wmnet with reason: Downtiming replaced wtp servers [15:10:22] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1027,1047-1048].eqiad.wmnet with reason: Downtiming replaced wtp servers [15:10:41] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1047.eqiad.wmnet [15:10:53] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1048.eqiad.wmnet [15:11:02] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1027.eqiad.wmnet [15:11:41] (03CR) 10Dduvall: [C: 03+2] Run all puppetized deploy scripts as checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/828655 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [15:14:49] (03PS1) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) [15:20:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34122 and previous config saved to /var/cache/conftool/dbconfig/20220907-152028-root.json [15:20:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10Epic: [tracking] Don't keep on the public vlans hosts that don't require it - https://phabricator.wikimedia.org/T317177 (10ayounsi) It is not never-ending as it's about converting existing hosts. Either they will or won't. New services are out of scope. [15:21:04] (03PS5) 10Hashar: Implement REST API and Ssh commands [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) [15:22:41] (03CR) 10Jbond: [C: 03+2] "The following script was used to audit the fall out of this CR" [puppet] - 10https://gerrit.wikimedia.org/r/809095 (owner: 10Jbond) [15:23:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34124 and previous config saved to /var/cache/conftool/dbconfig/20220907-152322-root.json [15:26:16] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:26:37] (03CR) 10Hashar: "I have extracted code from the original change at https://gerrit.wikimedia.org/r/c/operations/software/gerrit/plugins/events-wikimedia/+/8" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [15:27:40] (03CR) 10Hashar: "I have extracted some code to a parent change ( https://gerrit.wikimedia.org/r/c/operations/software/gerrit/plugins/events-wikimedia/+/830" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [15:29:44] (03PS1) 10Muehlenhoff: Drop a few now obsolete permission [puppet] - 10https://gerrit.wikimedia.org/r/830656 [15:30:07] (03CR) 10Jbond: [C: 03+2] C:prometheus: fix permissions for prometheus directory [puppet] - 10https://gerrit.wikimedia.org/r/830640 (owner: 10Jbond) [15:30:19] (03CR) 10Jbond: [C: 03+2] P:ci::slave: add group to homedir [puppet] - 10https://gerrit.wikimedia.org/r/830642 (owner: 10Jbond) [15:30:54] (03CR) 10Jbond: [C: 03+2] P:analytics::refinery::job::gobblin_job: fix permissions [puppet] - 10https://gerrit.wikimedia.org/r/830647 (owner: 10Jbond) [15:33:21] (03CR) 10Ssingh: [C: 03+1] "Is this ready to be merged? Just making sure in case it blocks on me for something, or if we are good to go. Thanks, excited about this!" [puppet] - 10https://gerrit.wikimedia.org/r/822135 (https://phabricator.wikimedia.org/T313845) (owner: 10Jbond) [15:34:11] (03CR) 10Jbond: [C: 03+1] "lgrm" [puppet] - 10https://gerrit.wikimedia.org/r/830645 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [15:34:33] (03PS1) 10Ladsgroup: auto_schema: Basic script for rolling restart [software] - 10https://gerrit.wikimedia.org/r/830659 [15:36:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) [15:36:41] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10Andrew) https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/toolforge/volume-admission-controller/+/refs/heads/main/deploymen... [15:37:24] (03CR) 10Ottomata: "Don't totally understand but I guess +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/830647 (owner: 10Jbond) [15:38:00] thanks ottomata was more an FYI in case it made something explode :) [15:38:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Pooling after maintenance', diff saved to https://phabricator.wikimedia.org/P34125 and previous config saved to /var/cache/conftool/dbconfig/20220907-153827-root.json [15:39:03] brett: are you happy for me to merge your change [15:39:06] (03CR) 10CI reject: [V: 04-1] auto_schema: Basic script for rolling restart [software] - 10https://gerrit.wikimedia.org/r/830659 (owner: 10Ladsgroup) [15:40:20] (03CR) 10Muehlenhoff: [C: 03+2] Fix entries in raid_mgmt_tools fact [puppet] - 10https://gerrit.wikimedia.org/r/830645 (https://phabricator.wikimedia.org/T315608) (owner: 10Muehlenhoff) [15:41:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/830656 (owner: 10Muehlenhoff) [15:43:17] (03CR) 10Vgutierrez: Add Trafficserver SLO dashboard (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez) [15:43:28] brett: im going to merge as seems "mostly harmless" [15:44:01] jbond: Actually, there's currently discussion on reverting it [15:44:29] ahh ok i was just about 6to make a comment on the CRT questioning it :) [15:44:34] where is the dicussion? [15:45:01] fyi its merge for now but can revert [15:45:09] traffic-private for some reason. It's likely to be reverted [15:45:51] ack [15:45:54] (03CR) 10Jbond: acme-chief: use /usr/bin/env as python interpreter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818234 (owner: 10BCornwall) [15:46:11] i added my comment to the Cr ijm gussing the discussion is probably around a simlar thing [15:46:24] but if not ping me and i can expand [15:46:54] Yeah, it is. I'll include the relevant bits of the discussion [15:47:19] ack thanks [15:49:46] (03CR) 10Muehlenhoff: acme-chief: use /usr/bin/env as python interpreter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818234 (owner: 10BCornwall) [15:49:48] (03PS1) 10Elukey: ml-serve: raise connection limit to the MW API [deployment-charts] - 10https://gerrit.wikimedia.org/r/830661 (https://phabricator.wikimedia.org/T313915) [15:49:54] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:50:38] (03PS2) 10Btullis: Add the configuration to create LVM volumes for dse-k8s monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) [15:53:42] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:56:52] (03CR) 10BCornwall: [C: 03+2] acme-chief: use /usr/bin/env as python interpreter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818234 (owner: 10BCornwall) [16:00:24] * Krinkle briefly staging/testing something on deploy1002/mwmaint1002 [16:03:47] (03CR) 10Samtar: [C: 04-1] "No longer sure this is the correct file back-end per T317195#8218031, wait for clarification" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830657 (https://phabricator.wikimedia.org/T317195) (owner: 10Samtar) [16:04:19] (03PS1) 10BCornwall: Revert "acme-chief: use /usr/bin/env as python interpreter" [puppet] - 10https://gerrit.wikimedia.org/r/830598 [16:05:54] (03CR) 10Muehlenhoff: acme-chief: use /usr/bin/env as python interpreter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818234 (owner: 10BCornwall) [16:10:16] (03CR) 10Btullis: Add the configuration to create LVM volumes for dse-k8s monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830633 (https://phabricator.wikimedia.org/T310179) (owner: 10Btullis) [16:11:51] * Krinkle done testing on deploy/mwmaint [16:16:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [16:17:07] (03PS1) 10CDanis: Add turnilo config for new requestctl field [puppet] - 10https://gerrit.wikimedia.org/r/830666 (https://phabricator.wikimedia.org/T314578) [16:21:58] !log cparle@deploy1002 Started deploy [airflow-dags/platform_eng@9e4ed94]: (no justification provided) [16:22:16] !log cparle@deploy1002 Finished deploy [airflow-dags/platform_eng@9e4ed94]: (no justification provided) (duration: 00m 17s) [16:22:34] !log installing twisted security updates on bullseye [16:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:46] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [16:28:11] (03CR) 10Dduvall: [V: 03+2 C: 03+2] Run all puppetized deploy scripts as checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/828655 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [16:28:44] (03PS3) 10JMeybohm: Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251) [16:28:46] (03PS4) 10JMeybohm: Alert on high Kubernetes API error rate [alerts] - 10https://gerrit.wikimedia.org/r/830624 (https://phabricator.wikimedia.org/T311251) [16:28:48] (03PS2) 10JMeybohm: Alert on high Kubernetes API latency [alerts] - 10https://gerrit.wikimedia.org/r/830637 (https://phabricator.wikimedia.org/T311251) [16:29:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [16:30:51] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [16:31:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [16:31:19] (03CR) 10Volans: "FYI this could be simplified *a lot* if integrated into spicerack and doing it as a cookbook." [software] - 10https://gerrit.wikimedia.org/r/830659 (owner: 10Ladsgroup) [16:31:56] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@9e4ed94]: Update platform_eng Airflow to latest [16:32:06] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@9e4ed94]: Update platform_eng Airflow to latest (duration: 00m 10s) [16:33:02] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:42:49] (03CR) 10Volans: acme-chief: use /usr/bin/env as python interpreter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818234 (owner: 10BCornwall) [16:45:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830598 (owner: 10BCornwall) [16:46:03] (03CR) 10BCornwall: [C: 03+2] Revert "acme-chief: use /usr/bin/env as python interpreter" [puppet] - 10https://gerrit.wikimedia.org/r/830598 (owner: 10BCornwall) [16:46:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/830656 (owner: 10Muehlenhoff) [16:55:06] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:56:25] (03PS1) 10Ladsgroup: auto_schema: Allow runnint it on one dc only with --dc [software] - 10https://gerrit.wikimedia.org/r/830671 [16:56:54] (03CR) 10CI reject: [V: 04-1] auto_schema: Allow runnint it on one dc only with --dc [software] - 10https://gerrit.wikimedia.org/r/830671 (owner: 10Ladsgroup) [17:00:00] (03CR) 10Ladsgroup: auto_schema: Basic script for rolling restart (031 comment) [software] - 10https://gerrit.wikimedia.org/r/830659 (owner: 10Ladsgroup) [17:01:50] (03CR) 10Joal: [C: 03+1] "LGTM - Thank you @CDanis" [puppet] - 10https://gerrit.wikimedia.org/r/830666 (https://phabricator.wikimedia.org/T314578) (owner: 10CDanis) [17:02:01] btullis: if you have a minute please --^ [17:06:51] 10SRE, 10ops-codfw, 10Observability-Logging: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 (10wiki_willy) a:03Papaul [17:14:34] (03PS2) 10Ladsgroup: auto_schema: Allow runnint it on one dc only with --dc [software] - 10https://gerrit.wikimedia.org/r/830671 [17:15:08] (03CR) 10BCornwall: [C: 03+2] acme-chief: use /usr/bin/env as python interpreter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/818234 (owner: 10BCornwall) [17:23:38] (03PS1) 10Ladsgroup: auto_schema: Add support for a graceful stop [software] - 10https://gerrit.wikimedia.org/r/830672 [17:23:42] (03PS3) 10Jdlrobson: Wikidata has a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830312 (https://phabricator.wikimedia.org/T315572) [17:24:26] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [17:25:24] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:29:38] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:34:20] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:34:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:35:06] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:48:14] (03PS1) 10Volans: sre.hosts.provision: reboot after RAID changes [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 [17:49:10] (03CR) 10Volans: sre.hosts.provision: reboot after RAID changes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/830676 (owner: 10Volans) [18:00:05] jeena and dduvall: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220907T1800). [18:00:05] jeena and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220907T1800). [18:02:57] Train is blocked for now due to some ongoing performance issues [18:17:46] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:20:30] 10SRE, 10Traffic, 10Patch-For-Review: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) 05Stalled→03In progress [18:25:06] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-codfw cluster: Roll restart of jvm daemons. [18:26:15] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [18:29:40] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:33:26] !log dduvall@deploy1002 Started deploy [phabricator/deployment@a7616e6]: testing deployment to phab2001 (inactive) [18:34:02] !log dduvall@deploy1002 Finished deploy [phabricator/deployment@a7616e6]: testing deployment to phab2001 (inactive) (duration: 00m 35s) [18:36:49] (03CR) 10CDanis: [C: 03+2] Add turnilo config for new requestctl field [puppet] - 10https://gerrit.wikimedia.org/r/830666 (https://phabricator.wikimedia.org/T314578) (owner: 10CDanis) [18:52:05] (03PS1) 10CDanis: Fix minor requestctl reporting bug [puppet] - 10https://gerrit.wikimedia.org/r/830677 (https://phabricator.wikimedia.org/T305582) [18:52:09] (03PS9) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) [18:52:31] (03PS10) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) [18:55:20] (03PS1) 10CDanis: Remove buggy comma filter footer [software/conftool] - 10https://gerrit.wikimedia.org/r/830678 (https://phabricator.wikimedia.org/T305582) [18:56:03] (03PS11) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) [18:56:39] (03PS12) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) [18:57:39] (03PS2) 10CDanis: Remove buggy comma filter footer [software/conftool] - 10https://gerrit.wikimedia.org/r/830678 (https://phabricator.wikimedia.org/T305582) [18:58:08] (03CR) 10AOkoth: vrts: install vrts script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [18:58:16] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1001/37155/" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [19:02:30] (03CR) 10AOkoth: [C: 03+1] C:spamassassin Allow debugging of why service fails. [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [19:04:59] (03CR) 10Dzahn: "lgtm," [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [19:05:03] (03CR) 10RLazarus: [C: 03+1] Fix minor requestctl reporting bug [puppet] - 10https://gerrit.wikimedia.org/r/830677 (https://phabricator.wikimedia.org/T305582) (owner: 10CDanis) [19:06:06] (03CR) 10RLazarus: [C: 03+1] Remove buggy comma filter footer [software/conftool] - 10https://gerrit.wikimedia.org/r/830678 (https://phabricator.wikimedia.org/T305582) (owner: 10CDanis) [19:11:03] 10SRE, 10conftool, 10Patch-For-Review: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 (10CDanis) [19:12:46] (03PS2) 10Jdlrobson: Enable Extension:Nearby on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830313 (https://phabricator.wikimedia.org/T246493) [19:12:51] (03PS3) 10Jdlrobson: Enable Extension:Nearby on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830313 (https://phabricator.wikimedia.org/T246493) [19:14:33] (03PS13) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) [19:17:19] (03CR) 10Dzahn: [C: 03+1] "lgtm, merge and confirm the environment variables are set and exported. you can always edit if needed" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [19:17:21] (03CR) 10CDanis: [C: 03+2] Fix minor requestctl reporting bug [puppet] - 10https://gerrit.wikimedia.org/r/830677 (https://phabricator.wikimedia.org/T305582) (owner: 10CDanis) [19:24:59] (03CR) 10CDanis: [C: 03+2] Remove buggy comma filter footer [software/conftool] - 10https://gerrit.wikimedia.org/r/830678 (https://phabricator.wikimedia.org/T305582) (owner: 10CDanis) [19:26:51] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-logging-eqiad cluster: Roll restart of jvm daemons. [19:27:19] (03Merged) 10jenkins-bot: Remove buggy comma filter footer [software/conftool] - 10https://gerrit.wikimedia.org/r/830678 (https://phabricator.wikimedia.org/T305582) (owner: 10CDanis) [19:29:13] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) Clients: - rsyslog: `/etc/ssl/certs/wmf-ca-certificates.crt` - logstash collectors: `/etc/ssl/localcerts/wm... [19:31:34] (03PS1) 10Dduvall: phabricator: Allow deploy user to preserve environment when sudoing [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) [19:31:55] (03PS2) 10Dduvall: phabricator: Allow deploy user to preserve environment when sudoing [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) [19:32:00] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10Andrew) >>! In T317144#8215878, @rook wrote: > PAWS containers should start mounting `/mnt/nfs/dumps-clo... [19:34:09] (03CR) 10Dduvall: "I wasn't sure if `SETENV:` was the best approach (seems like a `Defaults!/usr/local/sbin/phab_deploy_* env_keep+=...` line might work as w" [puppet] - 10https://gerrit.wikimedia.org/r/830682 (https://phabricator.wikimedia.org/T313259) (owner: 10Dduvall) [19:36:02] 10SRE, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Dzahn) @fgiunchedi Maybe it's also ok if we just rename this ticket to "move those alerts to alertmanager" to recycle it? Would that be desired? [19:39:41] (03PS1) 10Cwhite: apifeatureusage: use new kafka truststore [puppet] - 10https://gerrit.wikimedia.org/r/830684 (https://phabricator.wikimedia.org/T300130) [19:42:24] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/37157/" [puppet] - 10https://gerrit.wikimedia.org/r/830684 (https://phabricator.wikimedia.org/T300130) (owner: 10Cwhite) [19:45:09] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T314139) (owner: 10Cwhite) [19:46:05] (03PS1) 10BCornwall: ats: Enable node_ats_config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830686 (https://phabricator.wikimedia.org/T292815) [19:46:29] (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830687 (https://phabricator.wikimedia.org/T315353) [19:46:45] (03PS1) 10BryanDavis: striker: bump container version to 2022-09-07-191738-production [puppet] - 10https://gerrit.wikimedia.org/r/830688 (https://phabricator.wikimedia.org/T296893) [19:46:55] (03PS3) 10Cwhite: logstash: reduce webrequest retention to 31 days [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) [19:47:16] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:49:16] (03PS4) 10Cwhite: logstash: reduce webrequest retention to 31 days [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) [19:49:40] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:50:46] (03PS1) 10Cwhite: logstash: reduce replica count to 1 after 1 day [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) [19:51:12] (03CR) 10BryanDavis: [V: 03+1] "PCC output: https://puppet-compiler.wmflabs.org/pcc-worker1002/37159/" [puppet] - 10https://gerrit.wikimedia.org/r/830688 (https://phabricator.wikimedia.org/T296893) (owner: 10BryanDavis) [19:52:41] (03CR) 10Andrew Bogott: [C: 03+2] striker: bump container version to 2022-09-07-191738-production [puppet] - 10https://gerrit.wikimedia.org/r/830688 (https://phabricator.wikimedia.org/T296893) (owner: 10BryanDavis) [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220907T2000). [20:00:05] zabe_, Jdlrobson, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:21] hai [20:01:33] Hey [20:02:00] zabe: around? first in queue :) [20:02:47] will move on to Jdlrobson [20:03:30] hi [20:03:40] Jdlrobson: around to test? :) [20:03:43] hi MatmaRex [20:04:52] hm, guess you're up then MatmaRex :D [20:05:00] (Jdlrobson was active on slack 5 minutes ago, he'll probably be around soon) [20:05:12] thanks TheresNoTime [20:05:26] to confirm, you want the script run first? [20:05:31] yes [20:05:40] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10rook) So far as I have found PAWS mounts both labstores to /mnt/nfs but links to various parts of just l... [20:05:44] ack, doing [20:06:33] TheresNoTime: relevant to the config change, do you know if there's a nice dashboard somewhere where i could watch mediawiki database replag or other health metrics? just in case. the config change will enable some new code that writes some new stuff to databases, it should all be perfectly reasonable, but just in case [20:06:53] present [20:07:05] sorry just answered a thread elsewhere :) [20:08:04] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10rook) https://github.com/toolforge/paws/pull/199 [20:08:15] !log puppet compiler out of disk space, (pcc-worker1003): identified build 37153 as huge compared to others in the filesystem, then clicked to delete it via integration.wm.org web UI [20:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:43] !log running `extensions/WikimediaMaintenance/createExtensionTables.php discussiontools` on mwmaint1002 [20:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:07] MatmaRex: hmm, one sec.. I've seen a replag dashboard before.. [20:09:36] (guessing you want more info than https://replag.toolforge.org/ ?) [20:09:44] I wouldn't worry about replag on creating a few tables [20:09:54] (well, not too much) [20:10:12] Reedy: it's not about the tables, it's about what goes in them :D [20:10:29] perhaps https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1 ? [20:10:42] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/821324 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [20:10:46] thanks [20:10:51] !log integration.wikimedia.org - clicked to delete builds 36713 and 37153 because they were several GB in size [20:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:15] i'm not really expecting issues, i'm fine with just something where it's bad if numbers go up [20:11:19] (currently in the `r`s, so almost done..) [20:11:19] (03CR) 10Herron: [C: 03+1] logstash: reduce replica count to 1 after 1 day [puppet] - 10https://gerrit.wikimedia.org/r/830690 (https://phabricator.wikimedia.org/T313099) (owner: 10Cwhite) [20:11:48] and if i see numbers go up i can ping people to revert changes [20:12:07] !log pcc-worker1003 - rm of /srv/jenkins/puppet-compiler/output/36713 and 37153 - /srv is back to 58% usage again [20:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:37] MatmaRex: script done, now going to do 830687 [20:13:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830687 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [20:13:57] (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830687 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [20:14:04] (also ack Jdlrobson, will do yours next :)) [20:14:23] !log samtar@deploy1002 Started scap: Backport for [[gerrit:830687|Enable wgDiscussionToolsEnablePermalinksBackend on all wikis (T315353)]] [20:14:26] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [20:14:48] !log samtar@deploy1002 samtar and matmarex: Backport for [[gerrit:830687|Enable wgDiscussionToolsEnablePermalinksBackend on all wikis (T315353)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:15:13] MatmaRex: live on mwdebug1001 [20:15:22] thanks, looking [20:16:28] looks good. this link works: https://test2.wikipedia.org/wiki/Special:GoToComment/c-Matma_Rex-20220907201500-Matma_Rex-2021-07-01T18:11:00.000Z [20:16:50] Great, will sync [20:18:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:19:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:19:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:20:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:21:20] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830687|Enable wgDiscussionToolsEnablePermalinksBackend on all wikis (T315353)]] (duration: 06m 57s) [20:21:23] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [20:22:13] bbiab just grabbing some water [20:22:36] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 1265 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:22:55] MatmaRex: I am seeing an increase in DB errors [20:23:23] hmm [20:23:49] T317236, T317237 [20:23:49] T317236: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'h-Sujalajus-2017-04-28T09:31:00.000Z' for key 'it_itemname'Function: MediaWiki\Extension\DiscussionTools\ThreadItemStore::insertThreadItemsQuery: INSERT INTO `discuss - https://phabricator.wikimedia.org/T317236 [20:23:50] T317237: Wikimedia\Rdbms\DBUnexpectedError: Cannot execute Wikimedia\Rdbms\Database::rollback critical section while session state is out of sync.A critical section from Wikimedia\Rdbms\Database::cancelAtomic has failed#0 /srv/mediaw - https://phabricator.wikimedia.org/T317237 [20:24:16] 1207 events in a spike, now stopped [20:24:24] TheresNoTime: Can you use ~dancy/devel-scap (instead of plain scap) for your next backport? [20:24:43] dancy: will do [20:24:49] Thx! [20:25:10] yeah, i see stuff in logstash too [20:25:50] https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors?_g=h@2294574&_a=h@a12d892 is where I'm looking, starting to get towards the point where I'd want to perhaps rollback [20:25:58] but it seems to have stopped happening at the moment? [20:26:15] yeah, if it continues, let's revert [20:26:43] i'm just wondering why it would happen, and if it could be related to the deployment itself? like servsers running different code or something [20:27:44] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:56] afaik there is a point where that could be the case [20:28:24] for the record, i think none of this is actually affecting users, the errors are happening in try…catch and logged explicitly [20:28:38] but clearly the feature isn't working as it should [20:29:14] MatmaRex: I am going to revert [20:29:23] yeah. thanks [20:29:48] (03PS1) 10Samtar: Revert "Enable wgDiscussionToolsEnablePermalinksBackend on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830600 [20:30:13] (03CR) 10Samtar: [C: 03+2] Revert "Enable wgDiscussionToolsEnablePermalinksBackend on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830600 (owner: 10Samtar) [20:31:06] (03Merged) 10jenkins-bot: Revert "Enable wgDiscussionToolsEnablePermalinksBackend on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830600 (owner: 10Samtar) [20:32:19] ah crap, dancy I used plain scap for that one, I'll use the devel- one for the next deploy, sorry [20:32:27] (syncing revert now) [20:32:34] no problem. [20:34:56] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:58] exceptions have stopped entirely now [20:35:27] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830600|Revert "Enable wgDiscussionToolsEnablePermalinksBackend on all wikis"]] (duration: 03m 42s) [20:35:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:36:09] Jdlrobson: okay, yours next :) starting with 830312 [20:36:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:36:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:36:48] sweet [20:37:04] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:37:12] (03CR) 10Samtar: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830312 (https://phabricator.wikimedia.org/T315572) (owner: 10Jdlrobson) [20:37:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:37:26] thanks TheresNoTime, sorry about that [20:37:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:37:49] No worries! Not a clue what happened there, I logged a few of the exceptions, I hope it helps :) [20:38:01] (03Merged) 10jenkins-bot: Wikidata has a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830312 (https://phabricator.wikimedia.org/T315572) (owner: 10Jdlrobson) [20:38:03] TheresNoTime, hey, I am now here [20:38:16] hey zabe_, just doing Jdlrobson's patches :) [20:38:23] sure [20:39:13] Jdlrobson: 830312 is on mwdebug1001 [20:39:16] TheresNoTime: FYI we might have one late addition relating to this UBN (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/830696) I'll keep you posted.. I'm trying to get confirmation we are happy to merge that now [20:39:19] Looking at 830312 now [20:39:29] 10SRE-OnFire, 10Observability-Alerting: Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10CDanis) [20:39:29] Jdlrobson: ack [20:40:10] TheresNoTime: LGTM feel free to sync that one [20:40:27] TheresNoTime: and yes looks like we want to backport this FlaggedRevs patch [20:40:35] adding to calendar now [20:40:58] Okay, syncing (and using your devel- version dancy) [20:41:02] (03PS1) 10Jdlrobson: Respect skin's TOC option [extensions/FlaggedRevs] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830601 (https://phabricator.wikimedia.org/T316947) [20:41:20] (03PS1) 10Jdlrobson: Respect skin's TOC option [extensions/FlaggedRevs] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830602 (https://phabricator.wikimedia.org/T316947) [20:42:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:42:54] just FYI Jdlrobson, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/830313 seems merge conflicted [20:43:04] (03PS4) 10Jdlrobson: Enable Extension:Nearby on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830313 (https://phabricator.wikimedia.org/T246493) [20:43:07] just needs a rebase [20:43:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:43:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:43:35] [sidenote] I'm not sure why but mediawiki-config patches always seem to report merge conflicts when they don't have them? [20:43:55] "because gerrit" [20:43:58] Gerrit is picky about what it considers to be a conflict. [20:44:03] Edit to the same file == conflict [20:44:14] TIL [20:44:21] !log samtar@deploy1002 Synchronized static/images/mobile/copyright/wikidata-en.svg: Config: [[gerrit:830312|Wikidata has a wordmark (T315572)]] (duration: 03m 45s) [20:44:24] T315572: Add wordmark for Wikidata (Vector 2022) - https://phabricator.wikimedia.org/T315572 [20:44:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:36] Jdlrobson: will do your other patch, 830313 [20:44:43] I think that is very intended for mediawiki-config (because letting git merge without anyone looking can be dangerous here) [20:44:57] oh wait no [20:45:00] (03PS1) 10BryanDavis: striker: bump container version to 2022-09-07-203936-production [puppet] - 10https://gerrit.wikimedia.org/r/830699 (https://phabricator.wikimedia.org/T315706) [20:45:39] (hadn't done InitialiseSettings, still syncing the wordmark patch, sigh) [20:45:54] Ack [20:46:41] * TheresNoTime will run the window until everything is done [20:46:48] don't think we'll be much over anyway [20:48:04] > Gerrit is picky about what it considers to be a conflict. -- and uses a pure java implementation of the git protocol that is not as good at resolving nearby changes as git. [20:49:03] (specifically https://www.eclipse.org/jgit/) [20:49:06] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:830312|Wikidata has a wordmark (T315572)]] (duration: 03m 44s) [20:49:19] (03CR) 10BryanDavis: [V: 03+1] "PCC output: https://puppet-compiler.wmflabs.org/pcc-worker1003/37161/" [puppet] - 10https://gerrit.wikimedia.org/r/830699 (https://phabricator.wikimedia.org/T315706) (owner: 10BryanDavis) [20:49:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830313 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [20:49:49] dancy: is that what you wanted to test ^ [20:49:58] It is. Thank you! [20:50:11] (03CR) 10Andrew Bogott: [C: 03+2] striker: bump container version to 2022-09-07-203936-production [puppet] - 10https://gerrit.wikimedia.org/r/830699 (https://phabricator.wikimedia.org/T315706) (owner: 10BryanDavis) [20:50:21] (03Merged) 10jenkins-bot: Enable Extension:Nearby on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830313 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [20:50:47] !log samtar@deploy1002 Started scap: Backport for [[gerrit:830313|Enable Extension:Nearby on wikidata (T246493)]] [20:50:52] T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493 [20:51:11] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:830313|Enable Extension:Nearby on wikidata (T246493)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:51:28] (I do love that script) [20:51:40] Jdlrobson: live on mwdebug1001 [20:51:59] testing [20:52:29] LGTM [20:52:37] syncing [20:53:59] zabe_: I'm guessing 830641 doesn't need much/any testing? [20:54:21] yep [20:54:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:54:44] okay, I'll do that next, then move on to Jdlrobson's backports [20:54:49] ack [20:55:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:55:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:56:03] (03PS2) 10Samtar: beta: Remove deployment-parsoid11 from wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830641 (owner: 10Zabe) [20:56:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:56:42] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830313|Enable Extension:Nearby on wikidata (T246493)]] (duration: 05m 54s) [20:56:46] T246493: [EPIC] Deploy NearbyPages everywhere - https://phabricator.wikimedia.org/T246493 [20:57:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830641 (owner: 10Zabe) [20:57:54] (03Merged) 10jenkins-bot: beta: Remove deployment-parsoid11 from wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830641 (owner: 10Zabe) [20:58:18] !log samtar@deploy1002 Started scap: Backport for [[gerrit:830641|beta: Remove deployment-parsoid11 from wgLinterSubmitterWhitelist]] [20:58:41] !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:830641|beta: Remove deployment-parsoid11 from wgLinterSubmitterWhitelist]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:58:49] syncing [20:59:40] scap backport could probably skip all syncing steps for labs-only patches [21:00:14] yeah worth logging that as a feature if it's not already [21:00:33] even if it's just hardcoded to the `x-labs.php` files [21:01:06] !log extending UTC late backport window [21:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:24] (03PS1) 10Arlolra: Fix selser on html endpoints [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830702 (https://phabricator.wikimedia.org/T317215) [21:01:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:02:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:02:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:02:45] (03PS1) 10Arlolra: Fix selser on html endpoints [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830703 (https://phabricator.wikimedia.org/T317215) [21:02:55] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830641|beta: Remove deployment-parsoid11 from wgLinterSubmitterWhitelist]] (duration: 04m 36s) [21:03:06] okay Jdlrobson, doing those backports now starting with 830601 [21:03:19] (03CR) 10Subramanya Sastry: [C: 03+1] Fix selser on html endpoints [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830702 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra) [21:03:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830601 (https://phabricator.wikimedia.org/T316947) (owner: 10Jdlrobson) [21:03:24] created T317242 [21:03:24] T317242: Make "scap backport" skip syncing steps for labs-only changes - https://phabricator.wikimedia.org/T317242 [21:03:35] (03CR) 10Subramanya Sastry: [C: 03+1] Fix selser on html endpoints [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830703 (https://phabricator.wikimedia.org/T317215) (owner: 10Arlolra) [21:04:21] TheresNoTime: is there time to backport my patches above? [21:04:53] arlolra: we're over time I'm afraid, sorry [21:06:06] ( and these two UBN ones take like 15 minutes in Jenkins :(( ) [21:06:13] :) [21:06:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:06:23] but also :( [21:06:33] :/ [21:06:53] (03Merged) 10jenkins-bot: Respect skin's TOC option [extensions/FlaggedRevs] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830601 (https://phabricator.wikimedia.org/T316947) (owner: 10Jdlrobson) [21:07:21] !log samtar@deploy1002 Started scap: Backport for [[gerrit:830601|Respect skin's TOC option (T316947)]] [21:07:24] T316947: FlaggedRevisions incorrectly displays table of contents on Vector 2022 for flagged pages - https://phabricator.wikimedia.org/T316947 [21:07:45] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:830601|Respect skin's TOC option (T316947)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:08:04] Jdlrobson: can this be tested on mwdebug1001 ? [21:08:22] (I note it already has been according to the commit message but...) [21:09:09] checking [21:10:17] TheresNoTime: LGTM please sync [21:10:23] syncing [21:11:16] arlolra's patch is technically an UBN since it stops dirty diffs on VE edits on officewiki, wikitech and other non-RESTBase wikis but, given that it has been broken now for about 10 days, we could wait till tomorrow's backport window. [21:11:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:12:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:12:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:12:39] subbu: I'd certainly appreciate it if it could be scheduled for tomorrow morning instead, I'm going to be ~30 minutes over the window [21:13:13] plus playing with Gerrit for more than an hour is not advised [21:13:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:13:32] tomorrow will be fine [21:13:33] thanks [21:13:35] sounds good. we can do it tomorrow. thanks. [21:13:44] Thank you for the understanding :) [21:14:27] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830601|Respect skin's TOC option (T316947)]] (duration: 07m 06s) [21:14:30] T316947: FlaggedRevisions incorrectly displays table of contents on Vector 2022 for flagged pages - https://phabricator.wikimedia.org/T316947 [21:14:43] Jdlrobson: okay, now 830602, last one :) [21:14:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830602 (https://phabricator.wikimedia.org/T316947) (owner: 10Jdlrobson) [21:17:51] TheresNoTime: <3 [21:18:25] oh, TheresNoTime, no more CR+2, scap pull, scap deploy, etc.? [21:18:29] (03Merged) 10jenkins-bot: Respect skin's TOC option [extensions/FlaggedRevs] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830602 (https://phabricator.wikimedia.org/T316947) (owner: 10Jdlrobson) [21:18:55] !log samtar@deploy1002 Started scap: Backport for [[gerrit:830602|Respect skin's TOC option (T316947)]] [21:19:09] hauskater: it's *awesome* [21:19:18] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:830602|Respect skin's TOC option (T316947)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:19:21] Me? I know :P [21:19:39] Jdlrobson: live on mwdebug1001 [21:20:28] hauskater: shush you [21:22:50] going to go ahead and sync [21:23:16] (testing) [21:23:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:24:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:24:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:24:28] TheresNoTime: good to sync [21:24:38] :D [21:25:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:26:57] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:830602|Respect skin's TOC option (T316947)]] (duration: 08m 02s) [21:27:00] T316947: FlaggedRevisions incorrectly displays table of contents on Vector 2022 for flagged pages - https://phabricator.wikimedia.org/T316947 [21:27:09] Right all done :) [21:27:22] !log closing UTC late backport window, +27m [21:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:24] (03PS2) 10Andrew Bogott: Rename 'Galera haproxy failover' alert [puppet] - 10https://gerrit.wikimedia.org/r/830646 [21:33:26] (03PS1) 10Andrew Bogott: prometheus-openstack-stale-puppet-certs: preserve original cert name [puppet] - 10https://gerrit.wikimedia.org/r/830704 [21:37:35] (03CR) 10Andrew Bogott: [C: 03+2] Rename 'Galera haproxy failover' alert [puppet] - 10https://gerrit.wikimedia.org/r/830646 (owner: 10Andrew Bogott) [21:41:25] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) [21:41:40] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) [21:47:42] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp - https://phabricator.wikimedia.org/T317244 (10RobH) a:03BBlack @bblack, I have all of these staged for racking onsite (basically stacked in the racks but not on rails.) I have a few pending questions for you on these: 1) Half of these... [21:54:15] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [21:56:20] (03PS6) 10BCornwall: prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) [21:56:22] (03PS2) 10BCornwall: ats: Enable node_ats_config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830686 (https://phabricator.wikimedia.org/T292815) [21:56:23] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [21:56:57] (03CR) 10CI reject: [V: 04-1] prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [21:59:54] (03PS7) 10BCornwall: prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) [21:59:56] (03PS3) 10BCornwall: ats: Enable node_ats_config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/830686 (https://phabricator.wikimedia.org/T292815) [22:03:35] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) p:05Triage→03Medium [22:07:37] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) a:05RobH→03BBlack @bblack, The current question on T317244, is can I decom cp4021 and replace it with new cp4037 for testing? If so, is cp4037 to be a single or dual NVMe host? Once... [22:07:48] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) [22:12:56] !log Attempting to migrate all remaining Striker managed git repos from Diffusion to GitLab (T315706) [22:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:02] T315706: Migrate existing Striker created Diffusion repos to GitLab - https://phabricator.wikimedia.org/T315706 [22:17:33] ah that'll be why I was just added to a GitLab repo [22:17:36] :p [22:20:33] TheresNoTime: :) I imagine a number of folks are getting emails from gitlab right about now [22:44:39] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:55:59] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:02:21] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:19:21] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:26:06] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10bd808) >>! In T317144#8218632, @Andrew wrote: > That all sounds correct to me, although I'm not clear on... [23:34:29] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:36:07] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:40:38] 10SRE, 10ops-codfw, 10Observability-Logging: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 (10Papaul) ` Create Dispatch: Success You have successfully submitted request SR151072219. [23:49:27] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:52:39] (03PS1) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318)