[00:00:06] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:10] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40405 and previous config saved to /var/cache/conftool/dbconfig/20221122-000638-ladsgroup.json [00:06:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [00:06:45] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [00:06:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [00:07:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T323214)', diff saved to https://phabricator.wikimedia.org/P40406 and previous config saved to /var/cache/conftool/dbconfig/20221122-000700-ladsgroup.json [00:07:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P40407 and previous config saved to /var/cache/conftool/dbconfig/20221122-000739-ladsgroup.json [00:09:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P40408 and previous config saved to /var/cache/conftool/dbconfig/20221122-000904-ladsgroup.json [00:19:27] (03PS3) 10BCornwall: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 [00:21:32] PROBLEM - PHD should be running on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [00:22:01] (ProbeDown) firing: (2) Service phab1001:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:22:44] hello [00:22:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P40409 and previous config saved to /var/cache/conftool/dbconfig/20221122-002245-ladsgroup.json [00:22:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:22:52] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:23:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:23:40] mutante: phab1001 is the old host, right? so that alert is bogus? [00:24:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T322618)', diff saved to https://phabricator.wikimedia.org/P40410 and previous config saved to /var/cache/conftool/dbconfig/20221122-002411-ladsgroup.json [00:24:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:24:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:28:52] oh I see, it's an expired silence [00:31:05] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on phab1001.eqiad.wmnet with reason: T322250 [00:31:11] T322250: decom phab2001 (service owner) - https://phabricator.wikimedia.org/T322250 [00:31:20] rzl: yes, sorry. fixed. I thought 2 hours was plenty. it was not [00:31:21] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on phab1001.eqiad.wmnet with reason: T322250 [00:31:23] was just about to ask if I can do that, thanks :) [00:31:45] no worries! thanks for the work [00:32:10] resolving in VO [00:33:04] thanks [00:41:34] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:41:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:47:14] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:13:36] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250 [01:13:42] T322250: decom phab2001 (service owner) - https://phabricator.wikimedia.org/T322250 [01:13:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250 [01:14:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40411 and previous config saved to /var/cache/conftool/dbconfig/20221122-011404-ladsgroup.json [01:14:10] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [01:16:21] (03PS1) 10Dzahn: Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF" [dns] - 10https://gerrit.wikimedia.org/r/859077 [01:16:56] (03PS1) 10Dzahn: Revert "hieradata: switch active Phabricator server to phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/859078 [01:19:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:24:58] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:25:03] !log reverting to phab1001; short phabricator downtime incoming while DNS changes are made (T280597) [01:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:09] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [01:26:45] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250 [01:26:48] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250 [01:27:00] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF" [dns] - 10https://gerrit.wikimedia.org/r/859077 (owner: 10Dzahn) [01:27:21] (03CR) 10Dzahn: [C: 03+2] Revert "hieradata: switch active Phabricator server to phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/859078 (owner: 10Dzahn) [01:28:52] !log brennen@deploy1002 Started deploy [phabricator/deployment@f68dc24]: deploy config changes for phab1004 -> phab1001 revert [01:29:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221122-012910-ladsgroup.json [01:29:49] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f68dc24]: deploy config changes for phab1004 -> phab1001 revert (duration: 00m 56s) [01:35:22] RECOVERY - PHD should be running on phab1001 is OK: PROCS OK: 1 process with regex args php ./phd-daemon, UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [01:37:45] (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20221122-014417-ladsgroup.json [01:51:03] we had to revert. for now phab1001 is the prod server again despite earlier comments [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:54] !log dzahn@cumin2002 START - Cookbook sre.hosts.remove-downtime for phab1001.eqiad.wmnet [01:55:54] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for phab1001.eqiad.wmnet [01:56:16] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250 [01:56:20] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250 [01:56:21] T322250: decom phab2001 (service owner) - https://phabricator.wikimedia.org/T322250 [01:59:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40412 and previous config saved to /var/cache/conftool/dbconfig/20221122-015923-ladsgroup.json [01:59:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [01:59:29] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [01:59:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [02:06:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T323214)', diff saved to https://phabricator.wikimedia.org/P40413 and previous config saved to /var/cache/conftool/dbconfig/20221122-020628-ladsgroup.json [02:06:34] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P40414 and previous config saved to /var/cache/conftool/dbconfig/20221122-022134-ladsgroup.json [02:23:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:24:54] (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:36:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P40415 and previous config saved to /var/cache/conftool/dbconfig/20221122-023641-ladsgroup.json [02:51:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:51:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T323214)', diff saved to https://phabricator.wikimedia.org/P40416 and previous config saved to /var/cache/conftool/dbconfig/20221122-025148-ladsgroup.json [02:51:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [02:51:54] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [02:52:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [02:52:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40417 and previous config saved to /var/cache/conftool/dbconfig/20221122-025209-ladsgroup.json [02:52:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:55:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48976 bytes in 3.633 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:56:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.377 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T0300) [03:49:49] (03PS1) 10KartikMistry: Make Western Frisian Wikipedia Machine Translation stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859161 (https://phabricator.wikimedia.org/T323415) [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T0400) [04:00:38] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:01:16] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:03:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [04:04:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [04:04:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:04:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:04:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T323214)', diff saved to https://phabricator.wikimedia.org/P40418 and previous config saved to /var/cache/conftool/dbconfig/20221122-040429-ladsgroup.json [04:04:35] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [04:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40419 and previous config saved to /var/cache/conftool/dbconfig/20221122-045406-ladsgroup.json [04:54:12] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [05:07:04] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:08:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [05:09:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P40420 and previous config saved to /var/cache/conftool/dbconfig/20221122-050912-ladsgroup.json [05:24:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P40421 and previous config saved to /var/cache/conftool/dbconfig/20221122-052419-ladsgroup.json [05:25:14] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST authorizationpolicies) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:39:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40422 and previous config saved to /var/cache/conftool/dbconfig/20221122-053925-ladsgroup.json [05:39:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [05:39:32] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [05:39:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [05:39:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T323214)', diff saved to https://phabricator.wikimedia.org/P40423 and previous config saved to /var/cache/conftool/dbconfig/20221122-053947-ladsgroup.json [06:03:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T323214)', diff saved to https://phabricator.wikimedia.org/P40424 and previous config saved to /var/cache/conftool/dbconfig/20221122-060315-ladsgroup.json [06:03:21] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [06:18:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P40425 and previous config saved to /var/cache/conftool/dbconfig/20221122-061821-ladsgroup.json [06:23:54] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:24:54] (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:26:10] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P40426 and previous config saved to /var/cache/conftool/dbconfig/20221122-063328-ladsgroup.json [06:44:46] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:48:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T323214)', diff saved to https://phabricator.wikimedia.org/P40427 and previous config saved to /var/cache/conftool/dbconfig/20221122-064834-ladsgroup.json [06:48:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [06:48:41] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [06:48:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [06:48:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T323214)', diff saved to https://phabricator.wikimedia.org/P40428 and previous config saved to /var/cache/conftool/dbconfig/20221122-064856-ladsgroup.json [06:50:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T323116 [06:50:29] T323116: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T323116 [06:50:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T323116 [06:52:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1162 with weight 0 T323116', diff saved to https://phabricator.wikimedia.org/P40429 and previous config saved to /var/cache/conftool/dbconfig/20221122-065219-ladsgroup.json [06:56:44] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T0700). [07:00:23] need a couple of minutes to finish the topology move [07:08:35] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1157 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/858380 (https://phabricator.wikimedia.org/T323546) [07:08:38] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/858381 (https://phabricator.wikimedia.org/T323546) [07:08:56] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:09:06] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/858382 (https://phabricator.wikimedia.org/T323547) [07:09:10] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/858383 (https://phabricator.wikimedia.org/T323547) [07:12:33] (03PS2) 10Giuseppe Lavagetto: scap: add mw on k8s dsh list [puppet] - 10https://gerrit.wikimedia.org/r/858988 (https://phabricator.wikimedia.org/T323349) [07:12:46] done now [07:13:40] (03PS2) 10Giuseppe Lavagetto: blubberoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/858206 [07:13:42] (03PS2) 10Ladsgroup: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/856496 (https://phabricator.wikimedia.org/T323116) (owner: 10Gerrit maintenance bot) [07:13:45] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/856496 (https://phabricator.wikimedia.org/T323116) (owner: 10Gerrit maintenance bot) [07:14:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38374/console" [puppet] - 10https://gerrit.wikimedia.org/r/858988 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [07:14:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T323214)', diff saved to https://phabricator.wikimedia.org/P40430 and previous config saved to /var/cache/conftool/dbconfig/20221122-071442-ladsgroup.json [07:14:49] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [07:16:48] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:03] !log Starting s2 eqiad failover from db1122 to db1162 - T323116 [07:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:08] T323116: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T323116 [07:17:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T323116', diff saved to https://phabricator.wikimedia.org/P40431 and previous config saved to /var/cache/conftool/dbconfig/20221122-071727-ladsgroup.json [07:17:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1162 to s2 primary and set section read-write T323116', diff saved to https://phabricator.wikimedia.org/P40432 and previous config saved to /var/cache/conftool/dbconfig/20221122-071759-ladsgroup.json [07:21:25] (03PS2) 10Ladsgroup: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/856497 (https://phabricator.wikimedia.org/T323116) (owner: 10Gerrit maintenance bot) [07:21:34] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/856497 (https://phabricator.wikimedia.org/T323116) (owner: 10Gerrit maintenance bot) [07:22:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:22:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:22:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40433 and previous config saved to /var/cache/conftool/dbconfig/20221122-072233-marostegui.json [07:22:39] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [07:23:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/858206 (owner: 10Giuseppe Lavagetto) [07:25:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40434 and previous config saved to /var/cache/conftool/dbconfig/20221122-072505-marostegui.json [07:28:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1122 T323116', diff saved to https://phabricator.wikimedia.org/P40435 and previous config saved to /var/cache/conftool/dbconfig/20221122-072802-ladsgroup.json [07:28:08] T323116: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T323116 [07:28:20] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:28:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance [07:29:00] (03Merged) 10jenkins-bot: blubberoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/858206 (owner: 10Giuseppe Lavagetto) [07:29:04] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance [07:29:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40436 and previous config saved to /var/cache/conftool/dbconfig/20221122-072918-marostegui.json [07:29:24] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [07:29:30] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Marostegui) >>! In T323512#8410369, @jcrespo wrote: > @Papaul, I wonder if we could do a "simple" test of checking the power supply redundancy by "pulling the plug" (literally or just pushing the on/off button) to che... [07:29:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P40437 and previous config saved to /var/cache/conftool/dbconfig/20221122-072949-ladsgroup.json [07:30:03] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [07:32:26] (03PS1) 10Marostegui: db2174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/859328 (https://phabricator.wikimedia.org/T323512) [07:33:03] (03CR) 10Marostegui: [C: 03+2] db2174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/859328 (https://phabricator.wikimedia.org/T323512) (owner: 10Marostegui) [07:33:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [07:33:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [07:39:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [07:39:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [07:40:06] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [07:40:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dse-k8s-etcd1003.eqiad.wmnet with reason: rack move of ganeti1012 [07:40:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P40438 and previous config saved to /var/cache/conftool/dbconfig/20221122-074011-marostegui.json [07:40:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dse-k8s-etcd1003.eqiad.wmnet with reason: rack move of ganeti1012 [07:40:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubetcd1004.eqiad.wmnet with reason: rack move of ganeti1012 [07:40:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubetcd1004.eqiad.wmnet with reason: rack move of ganeti1012 [07:41:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ml-etcd1002.eqiad.wmnet with reason: rack move of ganeti1012 [07:41:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ml-etcd1002.eqiad.wmnet with reason: rack move of ganeti1012 [07:42:22] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10MoritzMuehlenhoff) ganeti1012 can be powered down for the rack move; the remaining three VMs are redundant and have been silenced in monitoring. [07:43:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40439 and previous config saved to /var/cache/conftool/dbconfig/20221122-074323-marostegui.json [07:43:29] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [07:44:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T323214)', diff saved to https://phabricator.wikimedia.org/P40440 and previous config saved to /var/cache/conftool/dbconfig/20221122-074400-ladsgroup.json [07:44:06] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [07:44:35] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [07:44:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P40441 and previous config saved to /var/cache/conftool/dbconfig/20221122-074455-ladsgroup.json [07:49:48] (03PS1) 10Giuseppe Lavagetto: deployment_server::k8s: add new data structure for modules [puppet] - 10https://gerrit.wikimedia.org/r/859430 [07:50:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server::k8s: add new data structure for modules [puppet] - 10https://gerrit.wikimedia.org/r/859430 (owner: 10Giuseppe Lavagetto) [07:51:30] (03CR) 10Muehlenhoff: [C: 03+2] Retire two k8s Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/858587 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [07:52:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove dumpsdata100XH750.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858589 (owner: 10Muehlenhoff) [07:54:39] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [07:55:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P40442 and previous config saved to /var/cache/conftool/dbconfig/20221122-075518-marostegui.json [07:56:41] 10SRE, 10observability, 10serviceops, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutter pool, that is not something we wish for... [07:57:43] 10SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for Dom Walden - https://phabricator.wikimedia.org/T323549 (10dom_walden) [07:58:20] (03PS1) 10Muehlenhoff: Retire obsolete cloudvirt Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/859431 (https://phabricator.wikimedia.org/T156955) [07:58:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P40443 and previous config saved to /var/cache/conftool/dbconfig/20221122-075829-marostegui.json [07:58:34] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [07:58:44] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [07:59:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P40444 and previous config saved to /var/cache/conftool/dbconfig/20221122-075907-ladsgroup.json [07:59:24] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [08:00:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T323214)', diff saved to https://phabricator.wikimedia.org/P40445 and previous config saved to /var/cache/conftool/dbconfig/20221122-080002-ladsgroup.json [08:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:08] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [08:00:38] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [08:08:30] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [08:09:51] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [08:10:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:10:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:10:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40446 and previous config saved to /var/cache/conftool/dbconfig/20221122-081024-marostegui.json [08:10:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1100.eqiad.wmnet with reason: Maintenance [08:10:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1100.eqiad.wmnet with reason: Maintenance [08:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40447 and previous config saved to /var/cache/conftool/dbconfig/20221122-081029-ladsgroup.json [08:10:32] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [08:10:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T321126)', diff saved to https://phabricator.wikimedia.org/P40448 and previous config saved to /var/cache/conftool/dbconfig/20221122-081035-marostegui.json [08:10:45] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:12:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/859431 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:12:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40449 and previous config saved to /var/cache/conftool/dbconfig/20221122-081239-ladsgroup.json [08:13:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T321126)', diff saved to https://phabricator.wikimedia.org/P40450 and previous config saved to /var/cache/conftool/dbconfig/20221122-081307-marostegui.json [08:13:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P40451 and previous config saved to /var/cache/conftool/dbconfig/20221122-081336-marostegui.json [08:14:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P40452 and previous config saved to /var/cache/conftool/dbconfig/20221122-081413-ladsgroup.json [08:15:54] (03CR) 10Filippo Giunchedi: [C: 03+1] node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (owner: 10BCornwall) [08:19:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [08:19:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [08:20:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [08:20:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [08:20:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [08:20:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [08:20:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P40453 and previous config saved to /var/cache/conftool/dbconfig/20221122-082057-ladsgroup.json [08:21:02] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:23:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P40454 and previous config saved to /var/cache/conftool/dbconfig/20221122-082314-ladsgroup.json [08:27:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40455 and previous config saved to /var/cache/conftool/dbconfig/20221122-082746-ladsgroup.json [08:28:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P40456 and previous config saved to /var/cache/conftool/dbconfig/20221122-082813-marostegui.json [08:28:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40457 and previous config saved to /var/cache/conftool/dbconfig/20221122-082842-marostegui.json [08:28:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:28:48] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [08:28:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:29:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40458 and previous config saved to /var/cache/conftool/dbconfig/20221122-082904-marostegui.json [08:29:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T323214)', diff saved to https://phabricator.wikimedia.org/P40459 and previous config saved to /var/cache/conftool/dbconfig/20221122-082920-ladsgroup.json [08:29:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [08:29:25] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [08:29:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [08:30:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T323214)', diff saved to https://phabricator.wikimedia.org/P40460 and previous config saved to /var/cache/conftool/dbconfig/20221122-083003-ladsgroup.json [08:38:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40461 and previous config saved to /var/cache/conftool/dbconfig/20221122-083820-ladsgroup.json [08:42:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40462 and previous config saved to /var/cache/conftool/dbconfig/20221122-084252-ladsgroup.json [08:43:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P40463 and previous config saved to /var/cache/conftool/dbconfig/20221122-084320-marostegui.json [08:43:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40464 and previous config saved to /var/cache/conftool/dbconfig/20221122-084326-marostegui.json [08:43:32] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [08:53:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40465 and previous config saved to /var/cache/conftool/dbconfig/20221122-085327-ladsgroup.json [08:57:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40466 and previous config saved to /var/cache/conftool/dbconfig/20221122-085758-ladsgroup.json [08:58:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:58:05] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:58:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40467 and previous config saved to /var/cache/conftool/dbconfig/20221122-085820-ladsgroup.json [08:58:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T321126)', diff saved to https://phabricator.wikimedia.org/P40468 and previous config saved to /var/cache/conftool/dbconfig/20221122-085826-marostegui.json [08:58:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:58:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1110.eqiad.wmnet with reason: Maintenance [08:58:32] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [08:58:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P40469 and previous config saved to /var/cache/conftool/dbconfig/20221122-085832-marostegui.json [08:58:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T321126)', diff saved to https://phabricator.wikimedia.org/P40470 and previous config saved to /var/cache/conftool/dbconfig/20221122-085843-marostegui.json [08:59:35] jouncebot: next [08:59:35] In 5 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400) [08:59:35] In 5 hour(s) and 0 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400) [08:59:45] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: start mirroring traffic to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/858610 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [09:00:24] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [09:00:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40471 and previous config saved to /var/cache/conftool/dbconfig/20221122-090030-ladsgroup.json [09:01:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321126)', diff saved to https://phabricator.wikimedia.org/P40472 and previous config saved to /var/cache/conftool/dbconfig/20221122-090115-marostegui.json [09:07:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apertium: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856945 (owner: 10Giuseppe Lavagetto) [09:08:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P40473 and previous config saved to /var/cache/conftool/dbconfig/20221122-090833-ladsgroup.json [09:08:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [09:08:39] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:08:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [09:08:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [09:08:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P40474 and previous config saved to /var/cache/conftool/dbconfig/20221122-090854-ladsgroup.json [09:10:49] 10SRE, 10Infrastructure-Foundations: Implement email address validation workflow - https://phabricator.wikimedia.org/T320808 (10SLyngshede-WMF) 05Open→03In progress [09:10:51] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF) [09:11:02] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P40475 and previous config saved to /var/cache/conftool/dbconfig/20221122-091112-ladsgroup.json [09:11:31] !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet [09:12:28] (03Merged) 10jenkins-bot: apertium: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856945 (owner: 10Giuseppe Lavagetto) [09:12:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1050: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859095 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [09:13:02] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:15] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10jcrespo) a:03Papaul [09:13:20] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1050.eqiad.wmnet with OS bullseye [09:13:30] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1050.eqiad.wmnet with O... [09:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P40476 and previous config saved to /var/cache/conftool/dbconfig/20221122-091339-marostegui.json [09:15:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:15:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40477 and previous config saved to /var/cache/conftool/dbconfig/20221122-091537-ladsgroup.json [09:16:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P40478 and previous config saved to /var/cache/conftool/dbconfig/20221122-091621-marostegui.json [09:16:25] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1049: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859435 (https://phabricator.wikimedia.org/T319184) [09:16:56] (03CR) 10Vgutierrez: [C: 03+1] node: Exclude trafficserver promfile mtime check (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/858658 (owner: 10BCornwall) [09:17:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [09:18:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. You may want to collect +1 from Andrew as well to be on the safe side." [puppet] - 10https://gerrit.wikimedia.org/r/859431 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:18:45] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply [09:19:51] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply [09:20:00] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:20:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:21:04] PROBLEM - SSH on mw1329.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:22:11] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply [09:22:22] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [09:23:50] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply [09:24:44] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [09:25:16] !log failover Ganeti master in eqiad to ganeti1028 T311687 [09:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:21] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [09:26:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40479 and previous config saved to /var/cache/conftool/dbconfig/20221122-092618-ladsgroup.json [09:27:28] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage [09:28:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40480 and previous config saved to /var/cache/conftool/dbconfig/20221122-092845-marostegui.json [09:28:51] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:30:18] PROBLEM - ganeti-wconfd running on ganeti1027 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:30:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40481 and previous config saved to /var/cache/conftool/dbconfig/20221122-093044-ladsgroup.json [09:31:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P40482 and previous config saved to /var/cache/conftool/dbconfig/20221122-093128-marostegui.json [09:31:56] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage [09:33:41] PROBLEM - graphite.wikimedia.org requires authentication on graphite2004 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 200 OK https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:34:22] uh ^? [09:34:47] it seems it is not routable from public network, but still not great [09:35:09] well.. graphite.wikimedia.org is a public endpoint [09:35:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1106.eqiad.wmnet with reason: Maintenance [09:35:33] yeah, but the public endpoint doesn't point there I think [09:35:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1106.eqiad.wmnet with reason: Maintenance [09:35:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:35:46] I am checking recent commits [09:35:49] that's me, apologies for the spam [09:35:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:35:52] graphite2004 is a new host [09:35:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T321130)', diff saved to https://phabricator.wikimedia.org/P40483 and previous config saved to /var/cache/conftool/dbconfig/20221122-093556-marostegui.json [09:35:58] I'll silence it [09:36:00] godog: ack [09:36:02] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:36:36] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on graphite2004.codfw.wmnet with reason: setup [09:36:42] done ^ [09:36:50] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on graphite2004.codfw.wmnet with reason: setup [09:36:50] vgutierrez: what I meant is apache was configured for the public endpoing but it was not reacheble through it [09:38:53] PROBLEM - Ganeti memory on ganeti1015 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (4131886) = 12.7% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [09:40:08] ^ moritzm there was an increase in memory utilization, if you are still reimaging those they may need a rebalance afterawards [09:40:29] https://grafana.wikimedia.org/goto/w8X-ZbO4k?orgId=1 [09:40:44] yeah, that's known, I'm currently reshuffling VMs for reboots [09:40:55] no worries then [09:41:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40484 and previous config saved to /var/cache/conftool/dbconfig/20221122-094125-ladsgroup.json [09:41:45] hopefully we can move mailman to a dedicated host to free some resources there soon [09:44:21] in general we have enough headroom, it's just temporal spikes during reimages/reboots since the cluster isn't rebalanced after every reboot/reimage, but rather when the entire work is completed [09:45:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40485 and previous config saved to /var/cache/conftool/dbconfig/20221122-094550-ladsgroup.json [09:45:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:45:56] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:46:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:46:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P40486 and previous config saved to /var/cache/conftool/dbconfig/20221122-094611-ladsgroup.json [09:46:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321126)', diff saved to https://phabricator.wikimedia.org/P40487 and previous config saved to /var/cache/conftool/dbconfig/20221122-094635-marostegui.json [09:46:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:46:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [09:46:40] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [09:46:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40488 and previous config saved to /var/cache/conftool/dbconfig/20221122-094645-marostegui.json [09:47:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet [09:47:44] 10SRE, 10ops-codfw, 10DBA: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) [09:48:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40489 and previous config saved to /var/cache/conftool/dbconfig/20221122-094817-marostegui.json [09:48:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P40490 and previous config saved to /var/cache/conftool/dbconfig/20221122-094821-ladsgroup.json [09:48:39] 10SRE, 10ops-codfw, 10DBA: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) Once this host is back we need to make sure we apply {T321130} (enwiki) [09:48:46] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply [09:49:36] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [09:50:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321130)', diff saved to https://phabricator.wikimedia.org/P40491 and previous config saved to /var/cache/conftool/dbconfig/20221122-095003-marostegui.json [09:50:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T323214)', diff saved to https://phabricator.wikimedia.org/P40492 and previous config saved to /var/cache/conftool/dbconfig/20221122-095008-ladsgroup.json [09:50:09] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:50:14] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [09:51:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet [09:56:19] (03CR) 10Jcrespo: "Answer:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo) [09:56:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P40493 and previous config saved to /var/cache/conftool/dbconfig/20221122-095631-ladsgroup.json [09:56:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:56:38] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:56:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:56:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P40494 and previous config saved to /var/cache/conftool/dbconfig/20221122-095652-ladsgroup.json [09:58:22] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1050.eqiad.wmnet with OS bullseye [09:58:32] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1050.eqiad.wmnet with OS bu... [09:59:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P40495 and previous config saved to /var/cache/conftool/dbconfig/20221122-095910-ladsgroup.json [10:01:50] (03PS1) 10Jcrespo: Update changelog for release 1.1 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859446 (https://phabricator.wikimedia.org/T323485) [10:02:41] (03PS1) 10Hashar: Add CI results to a tab [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 [10:02:57] (03Abandoned) 10Hashar: Add CI results to a tab [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/856182 (owner: 10Hashar) [10:03:01] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:05] (03CR) 10CI reject: [V: 04-1] Add CI results to a tab [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (owner: 10Hashar) [10:03:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P40496 and previous config saved to /var/cache/conftool/dbconfig/20221122-100323-marostegui.json [10:03:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40497 and previous config saved to /var/cache/conftool/dbconfig/20221122-100328-ladsgroup.json [10:05:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P40498 and previous config saved to /var/cache/conftool/dbconfig/20221122-100509-marostegui.json [10:05:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P40499 and previous config saved to /var/cache/conftool/dbconfig/20221122-100515-ladsgroup.json [10:06:49] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] scap: add mw on k8s dsh list [puppet] - 10https://gerrit.wikimedia.org/r/858988 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto) [10:09:56] !log start backfilling data into graphite2004 - T315524 [10:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:02] T315524: Put graphite2004 in service - https://phabricator.wikimedia.org/T315524 [10:10:22] 10SRE, 10Traffic-Icebox: Let ats-tls handle port 80 - https://phabricator.wikimedia.org/T254235 (10Vgutierrez) 05Open→03Invalid ats-tls has been deprecated in favor of HAProxy [10:12:15] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:12:19] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1002.eqiad.wmnet with reason: ganeti reboot [10:12:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1002.eqiad.wmnet with reason: ganeti reboot [10:13:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [10:13:38] (03PS3) 10Giuseppe Lavagetto: api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950 [10:14:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40500 and previous config saved to /var/cache/conftool/dbconfig/20221122-101417-ladsgroup.json [10:15:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:16:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:16:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:16:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:16:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P40501 and previous config saved to /var/cache/conftool/dbconfig/20221122-101620-ladsgroup.json [10:18:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P40502 and previous config saved to /var/cache/conftool/dbconfig/20221122-101829-marostegui.json [10:18:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40503 and previous config saved to /var/cache/conftool/dbconfig/20221122-101834-ladsgroup.json [10:18:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [10:20:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P40504 and previous config saved to /var/cache/conftool/dbconfig/20221122-102016-marostegui.json [10:20:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P40505 and previous config saved to /var/cache/conftool/dbconfig/20221122-102021-ladsgroup.json [10:21:39] RECOVERY - SSH on mw1329.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:23:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:24:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:25:28] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [10:25:49] (03PS11) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [10:26:13] (03CR) 10Cathal Mooney: [C: 03+2] Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [10:27:24] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [10:28:09] !log jnuche@deploy1002 Started scap: testing k8s deploys [10:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40506 and previous config saved to /var/cache/conftool/dbconfig/20221122-102923-ladsgroup.json [10:29:54] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuraiton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [10:30:38] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) >>! In T308677#8290550, @jbond wrote: > just putting a note here. aft... [10:31:13] (03CR) 10Muehlenhoff: [C: 03+2] alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858603 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:31:52] (03PS1) 10Vgutierrez: orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) [10:33:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40507 and previous config saved to /var/cache/conftool/dbconfig/20221122-103336-marostegui.json [10:33:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:33:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance [10:33:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P40508 and previous config saved to /var/cache/conftool/dbconfig/20221122-103341-ladsgroup.json [10:33:42] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [10:33:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [10:33:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40509 and previous config saved to /var/cache/conftool/dbconfig/20221122-103346-marostegui.json [10:33:55] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:33:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [10:34:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T322618)', diff saved to https://phabricator.wikimedia.org/P40510 and previous config saved to /var/cache/conftool/dbconfig/20221122-103402-ladsgroup.json [10:34:19] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38375/console" [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [10:34:48] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudvirt: unset_maintenance: clarify SAL message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859451 [10:34:49] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:34:50] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:35:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321130)', diff saved to https://phabricator.wikimedia.org/P40511 and previous config saved to /var/cache/conftool/dbconfig/20221122-103522-marostegui.json [10:35:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1107.eqiad.wmnet with reason: Maintenance [10:35:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T323214)', diff saved to https://phabricator.wikimedia.org/P40512 and previous config saved to /var/cache/conftool/dbconfig/20221122-103527-ladsgroup.json [10:35:28] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:35:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:35:35] <_joe_> claime: uhm did we forget to merge the change to helmfile.yaml, did we? [10:35:37] T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214 [10:35:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1107.eqiad.wmnet with reason: Maintenance [10:35:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:35:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T321130)', diff saved to https://phabricator.wikimedia.org/P40513 and previous config saved to /var/cache/conftool/dbconfig/20221122-103544-marostegui.json [10:35:55] _joe_: which one ? [10:36:10] The do not log? [10:36:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T322618)', diff saved to https://phabricator.wikimedia.org/P40514 and previous config saved to /var/cache/conftool/dbconfig/20221122-103612-ladsgroup.json [10:36:15] It should have been [10:36:15] <_joe_> claime: yes [10:36:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40515 and previous config saved to /var/cache/conftool/dbconfig/20221122-103618-marostegui.json [10:37:03] https://gitlab.wikimedia.org/repos/releng/scap/-/commit/716d9b6cde07d14381c305cfaef9876bdf10ab5b [10:37:36] _joe_: ^ [10:38:23] <_joe_> claime: yeah but you also needed to change all the helmfiles right [10:38:29] <_joe_> else they'd run the hooks [10:38:51] _joe_: No, calling them with the environment variable set was enough when I tested manually [10:38:58] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:39:09] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:40:25] _joe_: Like `SUPPRESS_SAL=true helmfile -e eqiad -i apply` worked, so there may be something I'm missing [10:40:40] (03CR) 10Vgutierrez: [C: 03+1] confd: create /var/run/confd-template [puppet] - 10https://gerrit.wikimedia.org/r/859102 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [10:43:28] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: create /var/run/confd-template [puppet] - 10https://gerrit.wikimedia.org/r/859102 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [10:43:31] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:43:31] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:43:31] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:43:31] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:43:31] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [10:43:31] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:43:31] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:43:32] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [10:43:52] (03CR) 10Marostegui: [C: 03+1] orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [10:43:52] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:44:01] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [10:44:02] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [10:44:21] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:44:27] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:44:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P40516 and previous config saved to /var/cache/conftool/dbconfig/20221122-104429-ladsgroup.json [10:44:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [10:44:32] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:44:35] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:44:36] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:44:42] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:44:42] (03PS1) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 [10:44:43] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:44:43] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:44:43] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:44:43] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:44:43] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:44:43] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [10:44:43] !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:44:44] !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [10:44:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [10:44:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P40517 and previous config saved to /var/cache/conftool/dbconfig/20221122-104451-ladsgroup.json [10:44:56] Sorry for the flood. [10:45:13] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:45:14] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST authorizationpolicies) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:45:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P40518 and previous config saved to /var/cache/conftool/dbconfig/20221122-104534-ladsgroup.json [10:45:36] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [10:45:38] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:45:38] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [10:45:39] (03CR) 10Jcrespo: [C: 03+1] orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [10:45:44] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:45:44] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:46:02] !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:46:03] (03CR) 10Ladsgroup: [C: 03+1] orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [10:46:07] !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:47:04] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [10:47:08] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [10:47:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P40519 and previous config saved to /var/cache/conftool/dbconfig/20221122-104708-ladsgroup.json [10:49:16] !log jnuche@deploy1002 Finished scap: testing k8s deploys (duration: 21m 06s) [10:50:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321130)', diff saved to https://phabricator.wikimedia.org/P40520 and previous config saved to /var/cache/conftool/dbconfig/20221122-105021-marostegui.json [10:50:27] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:50:30] (03PS1) 10Jcrespo: Add man page for tranfer.py executable [software/transferpy] - 10https://gerrit.wikimedia.org/r/859455 [10:50:49] (03PS2) 10Jcrespo: Add man page for transfer.py executable [software/transferpy] - 10https://gerrit.wikimedia.org/r/859455 [10:51:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40521 and previous config saved to /var/cache/conftool/dbconfig/20221122-105118-ladsgroup.json [10:51:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [10:51:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P40522 and previous config saved to /var/cache/conftool/dbconfig/20221122-105124-marostegui.json [10:55:51] (03PS1) 10Vgutierrez: icinga: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859457 (https://phabricator.wikimedia.org/T238720) [10:58:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [10:58:59] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:59:49] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38376/console" [puppet] - 10https://gerrit.wikimedia.org/r/859457 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [11:00:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P40523 and previous config saved to /var/cache/conftool/dbconfig/20221122-110040-ladsgroup.json [11:00:58] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/859435 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:01:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1049: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859435 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [11:01:39] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1049.eqiad.wmnet with OS bullseye [11:01:49] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1049.eqiad.wmnet with O... [11:02:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40524 and previous config saved to /var/cache/conftool/dbconfig/20221122-110214-ladsgroup.json [11:03:50] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) Bother :-/ [11:04:34] (03CR) 10DCausse: [WIP] flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [11:05:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P40525 and previous config saved to /var/cache/conftool/dbconfig/20221122-110528-marostegui.json [11:05:29] !log stevemunene@deploy1002 Started deploy [analytics/turnilo/deploy@51da050]: (no justification provided) [11:06:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40526 and previous config saved to /var/cache/conftool/dbconfig/20221122-110625-ladsgroup.json [11:06:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P40527 and previous config saved to /var/cache/conftool/dbconfig/20221122-110631-marostegui.json [11:07:41] !log stevemunene@deploy1002 Finished deploy [analytics/turnilo/deploy@51da050]: (no justification provided) (duration: 02m 12s) [11:07:55] (03PS2) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [11:08:17] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [11:08:30] (03CR) 10Hnowlan: [C: 03+2] thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:08:41] (03CR) 10Hnowlan: thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:08:46] (03CR) 10Jbond: profile::kafka::broker: add pki_intermediate_name parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [11:09:10] (03PS2) 10Hnowlan: thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) [11:09:57] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:10:43] !log installing gnutls28 security updates [11:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:02] (03CR) 10Jbond: [C: 03+1] "ack makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/859112 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [11:11:42] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859457 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [11:12:44] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: write out puppet/pki CA certs [puppet] - 10https://gerrit.wikimedia.org/r/859112 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [11:13:33] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:40] (03CR) 10CI reject: [V: 04-1] thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:14:14] (03PS1) 10Ilias Sarantopoulos: ml-services: Update docker images to use single model server [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374) [11:15:35] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage [11:15:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P40528 and previous config saved to /var/cache/conftool/dbconfig/20221122-111547-ladsgroup.json [11:15:48] (03PS3) 10Hnowlan: thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) [11:16:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [11:17:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40529 and previous config saved to /var/cache/conftool/dbconfig/20221122-111721-ladsgroup.json [11:18:11] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [11:18:31] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage [11:20:32] (03PS1) 10Giuseppe Lavagetto: mediawiki-image-download: don't exit 1 if no images to remove [puppet] - 10https://gerrit.wikimedia.org/r/859465 [11:20:34] (03PS1) 10Giuseppe Lavagetto: scap::dsh: support querying puppetdb, use for k8s-workers [puppet] - 10https://gerrit.wikimedia.org/r/859466 [11:20:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P40530 and previous config saved to /var/cache/conftool/dbconfig/20221122-112034-marostegui.json [11:20:56] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [11:21:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T322618)', diff saved to https://phabricator.wikimedia.org/P40531 and previous config saved to /var/cache/conftool/dbconfig/20221122-112131-ladsgroup.json [11:21:37] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:21:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40532 and previous config saved to /var/cache/conftool/dbconfig/20221122-112137-marostegui.json [11:21:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:21:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:21:43] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:22:12] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38377/console" [puppet] - 10https://gerrit.wikimedia.org/r/859466 (owner: 10Giuseppe Lavagetto) [11:22:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [11:23:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki-image-download: don't exit 1 if no images to remove [puppet] - 10https://gerrit.wikimedia.org/r/859465 (owner: 10Giuseppe Lavagetto) [11:26:03] (03PS1) 10Vgutierrez: dumps: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859467 (https://phabricator.wikimedia.org/T238720) [11:28:01] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38378/console" [puppet] - 10https://gerrit.wikimedia.org/r/859467 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [11:28:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:28:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [11:28:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:28:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:28:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:28:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P40533 and previous config saved to /var/cache/conftool/dbconfig/20221122-112843-ladsgroup.json [11:28:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:28:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:28:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:28:50] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:28:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T321126)', diff saved to https://phabricator.wikimedia.org/P40534 and previous config saved to /var/cache/conftool/dbconfig/20221122-112856-marostegui.json [11:29:03] (03PS1) 10Jbond: convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 [11:29:06] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:30:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P40535 and previous config saved to /var/cache/conftool/dbconfig/20221122-113053-ladsgroup.json [11:30:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P40536 and previous config saved to /var/cache/conftool/dbconfig/20221122-113053-ladsgroup.json [11:31:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321126)', diff saved to https://phabricator.wikimedia.org/P40537 and previous config saved to /var/cache/conftool/dbconfig/20221122-113127-marostegui.json [11:32:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P40538 and previous config saved to /var/cache/conftool/dbconfig/20221122-113227-ladsgroup.json [11:32:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:32:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:32:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P40539 and previous config saved to /var/cache/conftool/dbconfig/20221122-113249-ladsgroup.json [11:33:49] (03CR) 10CI reject: [V: 04-1] convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (owner: 10Jbond) [11:35:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P40540 and previous config saved to /var/cache/conftool/dbconfig/20221122-113506-ladsgroup.json [11:35:12] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:35:17] (03CR) 10AikoChou: ml-services: Update docker images to use single model server (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374) (owner: 10Ilias Sarantopoulos) [11:35:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321130)', diff saved to https://phabricator.wikimedia.org/P40541 and previous config saved to /var/cache/conftool/dbconfig/20221122-113541-marostegui.json [11:35:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1118.eqiad.wmnet with reason: Maintenance [11:35:47] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:35:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1118.eqiad.wmnet with reason: Maintenance [11:36:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T321130)', diff saved to https://phabricator.wikimedia.org/P40542 and previous config saved to /var/cache/conftool/dbconfig/20221122-113602-marostegui.json [11:40:11] (03PS2) 10Giuseppe Lavagetto: scap::dsh: support querying puppetdb, use for k8s-workers [puppet] - 10https://gerrit.wikimedia.org/r/859466 [11:40:38] PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:44:56] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1049.eqiad.wmnet with OS bullseye [11:44:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:45:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1049.eqiad.wmnet with OS bu... [11:45:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap::dsh: support querying puppetdb, use for k8s-workers [puppet] - 10https://gerrit.wikimedia.org/r/859466 (owner: 10Giuseppe Lavagetto) [11:46:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40543 and previous config saved to /var/cache/conftool/dbconfig/20221122-114559-ladsgroup.json [11:46:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P40544 and previous config saved to /var/cache/conftool/dbconfig/20221122-114634-marostegui.json [11:49:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321130)', diff saved to https://phabricator.wikimedia.org/P40545 and previous config saved to /var/cache/conftool/dbconfig/20221122-114925-marostegui.json [11:49:31] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:49:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:50:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40546 and previous config saved to /var/cache/conftool/dbconfig/20221122-115012-ladsgroup.json [11:52:59] ACKNOWLEDGEMENT - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T318659 - Added more downtime, but replacement batteries are on their way https://wikitech.wikimedia.org/wiki/MegaCli%23 [11:52:59] ng [11:53:09] !log MAPS maintenance EQIAD: trigger full planet re-import for maps eqiad [11:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:34] 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) 05Open→03Resolved a:03Joe ` vgutierrez@lvs6001:~$ ./liberica etcd --config /home/vgutierrez/config.yaml Using config file: /home/vgutier... [11:56:16] (03PS1) 10Stevemunene: Allow introspection for staging environment. [puppet] - 10https://gerrit.wikimedia.org/r/859479 (https://phabricator.wikimedia.org/T308778) [11:56:44] (03CR) 10Hnowlan: [C: 03+1] api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950 (owner: 10Giuseppe Lavagetto) [11:56:55] !log MAPS maintenance EQIAD: trigger full planet re-import for maps eqiad - T314472 [11:56:56] T314472: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 [11:58:25] (03CR) 10ArielGlenn: [C: 03+1] "Great idea!" [puppet] - 10https://gerrit.wikimedia.org/r/859467 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [11:58:39] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: ganeti reboot [11:58:47] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] dumps: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859467 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [11:58:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: ganeti reboot [11:59:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: ganeti reboot [11:59:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: ganeti reboot [11:59:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [12:01:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40547 and previous config saved to /var/cache/conftool/dbconfig/20221122-120106-ladsgroup.json [12:01:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P40548 and previous config saved to /var/cache/conftool/dbconfig/20221122-120140-marostegui.json [12:02:53] (03CR) 10Reedy: [C: 03+1] build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle) [12:03:02] (03CR) 10Hnowlan: [C: 03+2] thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:04:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P40549 and previous config saved to /var/cache/conftool/dbconfig/20221122-120431-marostegui.json [12:04:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [12:04:53] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [12:04:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:05:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40550 and previous config saved to /var/cache/conftool/dbconfig/20221122-120519-ladsgroup.json [12:08:08] (03Merged) 10jenkins-bot: thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:10:34] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:10:37] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:11:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:14:06] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:14:22] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:15:43] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1048: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859481 (https://phabricator.wikimedia.org/T319184) [12:16:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P40551 and previous config saved to /var/cache/conftool/dbconfig/20221122-121612-ladsgroup.json [12:16:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:16:18] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:16:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:16:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40552 and previous config saved to /var/cache/conftool/dbconfig/20221122-121633-ladsgroup.json [12:16:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321126)', diff saved to https://phabricator.wikimedia.org/P40553 and previous config saved to /var/cache/conftool/dbconfig/20221122-121647-marostegui.json [12:16:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:16:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:16:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1200.eqiad.wmnet with reason: Maintenance [12:16:52] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:16:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T321126)', diff saved to https://phabricator.wikimedia.org/P40554 and previous config saved to /var/cache/conftool/dbconfig/20221122-121657-marostegui.json [12:18:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40555 and previous config saved to /var/cache/conftool/dbconfig/20221122-121843-ladsgroup.json [12:18:49] (03PS1) 10Hnowlan: thumbor: correct haproxy metrics URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/859482 [12:19:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321126)', diff saved to https://phabricator.wikimedia.org/P40556 and previous config saved to /var/cache/conftool/dbconfig/20221122-121928-marostegui.json [12:19:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P40557 and previous config saved to /var/cache/conftool/dbconfig/20221122-121938-marostegui.json [12:20:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P40558 and previous config saved to /var/cache/conftool/dbconfig/20221122-122025-ladsgroup.json [12:20:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [12:20:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [12:20:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:20:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:21:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P40559 and previous config saved to /var/cache/conftool/dbconfig/20221122-122103-ladsgroup.json [12:22:33] RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:23:01] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:09] (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1048: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859481 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:23:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P40560 and previous config saved to /var/cache/conftool/dbconfig/20221122-122320-ladsgroup.json [12:23:26] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:25:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1048: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859481 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [12:25:34] (03PS5) 10Urbanecm: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno) [12:25:43] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1048.eqiad.wmnet with OS bullseye [12:25:46] (03PS1) 10Giuseppe Lavagetto: image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 [12:25:48] (03PS1) 10Giuseppe Lavagetto: proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486 [12:25:50] (03PS1) 10Giuseppe Lavagetto: citoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859487 [12:25:52] (03PS1) 10Giuseppe Lavagetto: cxserver: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859488 [12:25:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1048.eqiad.wmnet with O... [12:26:01] (03PS1) 10Giuseppe Lavagetto: datahub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859489 [12:26:05] (03PS1) 10Giuseppe Lavagetto: developer-portal: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859490 [12:27:06] (03CR) 10CI reject: [V: 04-1] image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 (owner: 10Giuseppe Lavagetto) [12:27:08] (03CR) 10CI reject: [V: 04-1] proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486 (owner: 10Giuseppe Lavagetto) [12:27:13] (03CR) 10CI reject: [V: 04-1] citoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859487 (owner: 10Giuseppe Lavagetto) [12:27:24] (03PS2) 10Jbond: convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 [12:27:33] (03CR) 10CI reject: [V: 04-1] cxserver: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859488 (owner: 10Giuseppe Lavagetto) [12:27:41] (03CR) 10CI reject: [V: 04-1] datahub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859489 (owner: 10Giuseppe Lavagetto) [12:27:42] (03CR) 10CI reject: [V: 04-1] developer-portal: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859490 (owner: 10Giuseppe Lavagetto) [12:28:43] (03PS3) 10Jbond: convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 [12:29:34] jouncebot: nowandnext [12:29:34] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [12:29:34] In 1 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400) [12:29:34] In 1 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400) [12:29:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:30:31] (03PS1) 10Cathal Mooney: New release incorporating changes to the wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/859491 (https://phabricator.wikimedia.org/T312635) [12:31:36] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/859479 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene) [12:33:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40561 and previous config saved to /var/cache/conftool/dbconfig/20221122-123350-ladsgroup.json [12:33:51] (03CR) 10CI reject: [V: 04-1] convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (owner: 10Jbond) [12:34:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P40562 and previous config saved to /var/cache/conftool/dbconfig/20221122-123435-marostegui.json [12:34:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321130)', diff saved to https://phabricator.wikimedia.org/P40563 and previous config saved to /var/cache/conftool/dbconfig/20221122-123444-marostegui.json [12:34:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:34:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:35:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:35:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T321130)', diff saved to https://phabricator.wikimedia.org/P40564 and previous config saved to /var/cache/conftool/dbconfig/20221122-123505-marostegui.json [12:36:49] !log jnuche@deploy1002 Installing scap version "4.29.1" for 559 hosts [12:37:20] !log jnuche@deploy1002 Installation of scap version "4.29.1" completed for 559 hosts [12:38:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40565 and previous config saved to /var/cache/conftool/dbconfig/20221122-123827-ladsgroup.json [12:39:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:40:08] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage [12:40:42] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:47] !log jnuche@deploy1002 Started scap: testing k8s deploys [12:43:42] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage [12:44:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:45:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321130)', diff saved to https://phabricator.wikimedia.org/P40567 and previous config saved to /var/cache/conftool/dbconfig/20221122-124818-marostegui.json [12:48:25] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:48:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40568 and previous config saved to /var/cache/conftool/dbconfig/20221122-124856-ladsgroup.json [12:49:07] !log jnuche@deploy1002 Finished scap: testing k8s deploys (duration: 06m 20s) [12:49:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P40569 and previous config saved to /var/cache/conftool/dbconfig/20221122-124941-marostegui.json [12:50:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40570 and previous config saved to /var/cache/conftool/dbconfig/20221122-125333-ladsgroup.json [12:57:24] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Encoding issues when handling unicode characters in filenames - https://phabricator.wikimedia.org/T323114 (10hnowlan) 05Open→03Resolved [12:57:26] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [13:00:01] (03CR) 10Hnowlan: [C: 03+2] thumbor: correct haproxy metrics URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/859482 (owner: 10Hnowlan) [13:01:51] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P40571 and previous config saved to /var/cache/conftool/dbconfig/20221122-130325-marostegui.json [13:04:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40572 and previous config saved to /var/cache/conftool/dbconfig/20221122-130403-ladsgroup.json [13:04:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [13:04:10] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:04:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [13:04:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [13:04:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [13:04:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P40573 and previous config saved to /var/cache/conftool/dbconfig/20221122-130442-ladsgroup.json [13:04:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321126)', diff saved to https://phabricator.wikimedia.org/P40574 and previous config saved to /var/cache/conftool/dbconfig/20221122-130447-marostegui.json [13:04:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:04:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:04:53] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:05:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2101.codfw.wmnet with reason: Maintenance [13:05:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2101.codfw.wmnet with reason: Maintenance [13:06:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P40575 and previous config saved to /var/cache/conftool/dbconfig/20221122-130652-ladsgroup.json [13:06:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2111.codfw.wmnet with reason: Maintenance [13:06:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2111.codfw.wmnet with reason: Maintenance [13:07:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T321126)', diff saved to https://phabricator.wikimedia.org/P40576 and previous config saved to /var/cache/conftool/dbconfig/20221122-130701-marostegui.json [13:07:32] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1048.eqiad.wmnet with OS bullseye [13:07:37] (03Merged) 10jenkins-bot: thumbor: correct haproxy metrics URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/859482 (owner: 10Hnowlan) [13:07:39] PROBLEM - SSH on db1122.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:07:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1048.eqiad.wmnet with OS bu... [13:08:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [13:08:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P40577 and previous config saved to /var/cache/conftool/dbconfig/20221122-130840-ladsgroup.json [13:08:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:08:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:09:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40578 and previous config saved to /var/cache/conftool/dbconfig/20221122-130901-ladsgroup.json [13:09:32] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [13:09:50] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [13:10:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321126)', diff saved to https://phabricator.wikimedia.org/P40579 and previous config saved to /var/cache/conftool/dbconfig/20221122-131025-marostegui.json [13:10:31] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:11:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40580 and previous config saved to /var/cache/conftool/dbconfig/20221122-131118-ladsgroup.json [13:11:24] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:14:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:18:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P40581 and previous config saved to /var/cache/conftool/dbconfig/20221122-131831-marostegui.json [13:21:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40582 and previous config saved to /var/cache/conftool/dbconfig/20221122-132158-ladsgroup.json [13:25:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P40583 and previous config saved to /var/cache/conftool/dbconfig/20221122-132532-marostegui.json [13:26:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40584 and previous config saved to /var/cache/conftool/dbconfig/20221122-132625-ladsgroup.json [13:26:40] (03CR) 10Stevemunene: [C: 03+2] Allow introspection for staging environment. [puppet] - 10https://gerrit.wikimedia.org/r/859479 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene) [13:28:03] (03PS1) 10David Caro: dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720) [13:28:13] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [13:28:40] (03CR) 10CI reject: [V: 04-1] dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720) (owner: 10David Caro) [13:29:19] PROBLEM - Host ganeti1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:03] (03PS3) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114 [13:31:37] PROBLEM - Host ganeti1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:32:45] (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:32:48] (03CR) 10Vgutierrez: [C: 03+1] dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720) (owner: 10David Caro) [13:33:39] (03PS2) 10David Caro: dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720) [13:33:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321130)', diff saved to https://phabricator.wikimedia.org/P40585 and previous config saved to /var/cache/conftool/dbconfig/20221122-133339-marostegui.json [13:33:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:33:45] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:33:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:34:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T321130)', diff saved to https://phabricator.wikimedia.org/P40586 and previous config saved to /var/cache/conftool/dbconfig/20221122-133401-marostegui.json [13:34:44] (03CR) 10David Caro: [C: 03+2] dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720) (owner: 10David Caro) [13:34:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:37:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40587 and previous config saved to /var/cache/conftool/dbconfig/20221122-133705-ladsgroup.json [13:37:43] RECOVERY - Host ganeti1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [13:38:25] (03PS1) 10Elukey: team-sre: add druid alerts for webrequest_sampled_live [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) [13:40:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P40588 and previous config saved to /var/cache/conftool/dbconfig/20221122-134038-marostegui.json [13:41:18] !log marostegui@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [13:41:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40589 and previous config saved to /var/cache/conftool/dbconfig/20221122-134131-ladsgroup.json [13:42:03] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:43:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:06] (03CR) 10Filippo Giunchedi: "LGTM! nit inline, looking good otherwise" [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [13:46:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321130)', diff saved to https://phabricator.wikimedia.org/P40590 and previous config saved to /var/cache/conftool/dbconfig/20221122-134643-marostegui.json [13:46:49] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:48:54] (03PS2) 10Elukey: team-sre: add druid alerts for webrequest_sampled_live [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) [13:49:07] (03CR) 10Elukey: team-sre: add druid alerts for webrequest_sampled_live (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [13:52:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P40591 and previous config saved to /var/cache/conftool/dbconfig/20221122-135211-ladsgroup.json [13:52:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:52:17] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:52:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [13:52:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P40592 and previous config saved to /var/cache/conftool/dbconfig/20221122-135233-ladsgroup.json [13:54:34] (03PS2) 10Cathal Mooney: Release v0.6.1 update [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/859491 (https://phabricator.wikimedia.org/T312635) [13:54:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P40593 and previous config saved to /var/cache/conftool/dbconfig/20221122-135442-ladsgroup.json [13:55:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321126)', diff saved to https://phabricator.wikimedia.org/P40594 and previous config saved to /var/cache/conftool/dbconfig/20221122-135545-marostegui.json [13:55:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:55:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:55:50] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:55:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T321126)', diff saved to https://phabricator.wikimedia.org/P40595 and previous config saved to /var/cache/conftool/dbconfig/20221122-135556-marostegui.json [13:56:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40596 and previous config saved to /var/cache/conftool/dbconfig/20221122-135638-ladsgroup.json [13:56:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:56:41] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] icinga: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859457 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez) [13:56:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:57:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40597 and previous config saved to /var/cache/conftool/dbconfig/20221122-135659-ladsgroup.json [13:57:17] (03PS4) 10Jbond: C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) [13:57:46] !log block plain text requests on icinga.wm.o - T238720 [13:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:51] T238720: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 [13:58:10] (03PS3) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 [13:58:18] (03CR) 10CI reject: [V: 04-1] C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [13:58:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38379/console" [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [13:59:07] (03CR) 10CI reject: [V: 04-1] C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (owner: 10Jbond) [13:59:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40598 and previous config saved to /var/cache/conftool/dbconfig/20221122-135917-ladsgroup.json [13:59:22] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:59:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321126)', diff saved to https://phabricator.wikimedia.org/P40599 and previous config saved to /var/cache/conftool/dbconfig/20221122-135926-marostegui.json [13:59:54] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Aklapper) >>! In T316337#8216814, @jcrespo wrote: > I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish... [13:59:59] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400) [14:00:11] (03PS5) 10Jbond: C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) [14:00:42] (03PS6) 10Jbond: C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) [14:00:58] (03PS4) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 [14:01:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P40600 and previous config saved to /var/cache/conftool/dbconfig/20221122-140150-marostegui.json [14:01:59] (03CR) 10CI reject: [V: 04-1] C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (owner: 10Jbond) [14:03:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [14:05:28] (03CR) 10Jbond: C:swift::storage: add variable for data directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:06:24] !log marostegui@cumin1001 Added views for new wiki: bnwikiquote T319190 [14:06:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [14:06:29] T319190: Prepare and check storage layer for bnwikiquote - https://phabricator.wikimedia.org/T319190 [14:06:34] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Vgutierrez) > I am going to do it, but I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish (not just th... [14:08:43] (03PS5) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (https://phabricator.wikimedia.org/T308677) [14:08:45] (03PS5) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) [14:08:47] (03PS5) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) [14:09:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40601 and previous config saved to /var/cache/conftool/dbconfig/20221122-140949-ladsgroup.json [14:11:53] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1003.eqiad.wmnet with reason: ganeti reboot [14:12:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1003.eqiad.wmnet with reason: ganeti reboot [14:12:19] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on dse-k8s-etcd1001.eqiad.wmnet with reason: ganeti reboot [14:12:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dse-k8s-etcd1001.eqiad.wmnet with reason: ganeti reboot [14:12:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: ganeti reboot [14:13:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: ganeti reboot [14:13:05] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) Thanks, that is all I needed to understand the context! I will create a draft doc on Wikitech and link it here for review. [14:13:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [14:14:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40602 and previous config saved to /var/cache/conftool/dbconfig/20221122-141423-ladsgroup.json [14:14:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P40603 and previous config saved to /var/cache/conftool/dbconfig/20221122-141433-marostegui.json [14:14:47] RECOVERY - Host ganeti1012 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:15:08] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez) [14:16:01] PROBLEM - Check systemd state on ganeti1012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,nic-saturation-exporter.service,prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P40604 and previous config saved to /var/cache/conftool/dbconfig/20221122-141656-marostegui.json [14:17:45] (JobUnavailable) resolved: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [14:19:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950 (owner: 10Giuseppe Lavagetto) [14:19:57] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [14:22:39] (NodeTextfileStale) firing: Stale textfile for cloudvirt2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:23:23] <_joe_> jouncebot: next [14:23:23] In 2 hour(s) and 36 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1700) [14:23:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:24:28] (03Merged) 10jenkins-bot: api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950 (owner: 10Giuseppe Lavagetto) [14:24:54] (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:24:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40605 and previous config saved to /var/cache/conftool/dbconfig/20221122-142455-ladsgroup.json [14:24:59] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:26:26] Emperor: swift_ring_manager errors in thanos-fe1001 are expected? [14:28:05] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [14:28:11] Emperor: hmm... Nov 22 14:10:33 thanos-fe1001 swift_ring_manager[3724760]: urllib.error.HTTPError: HTTP Error 401: Unauthorized <-- issued by /usr/bin/swift-dispersion-report [14:29:27] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10MoritzMuehlenhoff) [14:29:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40606 and previous config saved to /var/cache/conftool/dbconfig/20221122-142930-ladsgroup.json [14:29:33] Emperor: so I'm guessing https://thanos-swift.discovery.wmnet/auth/v1.0 is the responsible for that 401 [14:29:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P40607 and previous config saved to /var/cache/conftool/dbconfig/20221122-142939-marostegui.json [14:32:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321130)', diff saved to https://phabricator.wikimedia.org/P40608 and previous config saved to /var/cache/conftool/dbconfig/20221122-143203-marostegui.json [14:32:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1132.eqiad.wmnet with reason: Maintenance [14:32:09] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:32:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1132.eqiad.wmnet with reason: Maintenance [14:32:22] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:32:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T321130)', diff saved to https://phabricator.wikimedia.org/P40609 and previous config saved to /var/cache/conftool/dbconfig/20221122-143224-marostegui.json [14:32:41] (03CR) 10Jbond: ms-be2050: enable disks by path configuerations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:33:03] RECOVERY - Check systemd state on ganeti1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:10] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:33:18] (03PS6) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) [14:33:20] (03PS6) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) [14:33:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [14:33:49] (03PS6) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (https://phabricator.wikimedia.org/T308677) [14:33:59] (03PS7) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) [14:34:07] (03PS7) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) [14:34:13] (03PS2) 10Ssingh: lvs4009: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/858336 (https://phabricator.wikimedia.org/T317247) [14:34:33] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:34:58] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) I am filling in: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-26_Phabricator_login_issues (Still WIP) [14:35:20] (03Merged) 10jenkins-bot: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto) [14:35:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38380/console" [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:35:37] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:36:00] (03PS4) 10Vgutierrez: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [14:36:26] godog: ^^ merging https://gerrit.wikimedia.org/r/858658 is enough to get it deployed? [14:37:05] vgutierrez: correct yeah, will be deployed at the next puppet run [14:37:14] ack [14:37:22] merging it.. we got some noise already [14:37:54] (as soon as jenkins-bot is happy with the current PS) [14:38:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:38:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:38:47] *nod* [14:38:54] (03CR) 10Jbond: [C: 03+2] C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:38:57] (03CR) 10Jbond: [C: 03+2] C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:39:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:39:01] (03CR) 10Ssingh: [C: 03+2] lvs4009: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/858336 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [14:39:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:39:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:39:23] jbond: ok to merge your changes? :) [14:39:27] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:39:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:39:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:40:02] sukhe: yes please [14:40:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P40610 and previous config saved to /var/cache/conftool/dbconfig/20221122-144002-ladsgroup.json [14:40:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [14:40:05] done! [14:40:08] thanks [14:40:08] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:40:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [14:40:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P40611 and previous config saved to /var/cache/conftool/dbconfig/20221122-144023-ladsgroup.json [14:40:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:40:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:40:42] (03CR) 10CI reject: [V: 04-1] node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [14:41:13] wonderful [14:41:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:41:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:41:29] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:41:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4009.ulsfo.wmnet with OS buster [14:42:11] node.yaml: 5:15: group "node_exporter", rule 1, "NodeTextfileStale": could not parse expression: 1:44: parse error: unknown escape sequence U+002E '.' [14:42:15] sigh [14:42:30] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Release v0.6.1 update [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/859491 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [14:42:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P40612 and previous config saved to /var/cache/conftool/dbconfig/20221122-144232-ladsgroup.json [14:43:39] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply config changes - bking@cumin2002 - T319020 [14:43:44] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [14:43:46] (03PS5) 10Vgutierrez: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [14:44:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40613 and previous config saved to /var/cache/conftool/dbconfig/20221122-144436-ladsgroup.json [14:44:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [14:44:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321126)', diff saved to https://phabricator.wikimedia.org/P40614 and previous config saved to /var/cache/conftool/dbconfig/20221122-144446-marostegui.json [14:44:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2128.codfw.wmnet with reason: Maintenance [14:44:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2128.codfw.wmnet with reason: Maintenance [14:44:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [14:44:52] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:44:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [14:44:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [14:44:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P40615 and previous config saved to /var/cache/conftool/dbconfig/20221122-144458-ladsgroup.json [14:44:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:45:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T321126)', diff saved to https://phabricator.wikimedia.org/P40616 and previous config saved to /var/cache/conftool/dbconfig/20221122-144507-marostegui.json [14:45:17] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:45:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321130)', diff saved to https://phabricator.wikimedia.org/P40617 and previous config saved to /var/cache/conftool/dbconfig/20221122-144519-marostegui.json [14:45:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [14:45:27] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:45:29] (03CR) 10CI reject: [V: 04-1] node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [14:45:36] !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 - cmooney@cumin1001 [14:45:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [14:47:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 - cmooney@cumin1001 [14:47:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P40618 and previous config saved to /var/cache/conftool/dbconfig/20221122-144715-ladsgroup.json [14:48:13] PROBLEM - Check systemd state on dbstore1007 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321126)', diff saved to https://phabricator.wikimedia.org/P40619 and previous config saved to /var/cache/conftool/dbconfig/20221122-144833-marostegui.json [14:48:37] vgutierrez: the occasional failure isn't the end of the world (it runs hourly); those auth failures are related to the frontends being loaded; I'm starting to wonder if we should think about more capacity there as well as ms- [14:48:59] !log jnuche@deploy1002 Started scap: testing k8s deploys [14:51:38] Emperor: ack [14:53:24] !log btullis@cumin1001 Added views for new wiki: tlwikiquote T317111 [14:53:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [14:53:30] T317111: Prepare and check storage layer for tlwikiquote - https://phabricator.wikimedia.org/T317111 [14:53:33] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:11] (03CR) 10Ilias Sarantopoulos: ml-services: Update docker images to use single model server (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374) (owner: 10Ilias Sarantopoulos) [14:54:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:55:06] (03PS2) 10Ilias Sarantopoulos: ml-services: Update docker images to use single model server [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374) [14:55:07] !log jnuche@deploy1002 Finished scap: testing k8s deploys (duration: 06m 08s) [14:55:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:55:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [14:55:49] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:56:41] !log oblivian@deploy1002 Started scap: Adding clusterconfig [14:57:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40620 and previous config saved to /var/cache/conftool/dbconfig/20221122-145738-ladsgroup.json [14:57:47] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:58:34] godog: node.yaml: 5:15: group "node_exporter", rule 1, "NodeTextfileStale": could not parse expression: 1:44: parse error: unknown escape sequence U+002E '.' --> any idea on how to properly escape a dot (.) in a regex on the alerts repo? [15:00:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P40621 and previous config saved to /var/cache/conftool/dbconfig/20221122-150025-marostegui.json [15:00:58] !log oblivian@deploy1002 Finished scap: Adding clusterconfig (duration: 04m 17s) [15:02:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40622 and previous config saved to /var/cache/conftool/dbconfig/20221122-150221-ladsgroup.json [15:03:12] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage [15:03:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P40623 and previous config saved to /var/cache/conftool/dbconfig/20221122-150339-marostegui.json [15:06:32] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:06:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage [15:07:28] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:07:32] RECOVERY - SSH on db1122.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:09:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:11:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:11:39] (03PS1) 10Jforrester: [Beta Cluster] Point Wikifunctions mobile links to the right place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859539 (https://phabricator.wikimedia.org/T314891) [15:12:26] jouncebot: now [15:12:26] No deployments scheduled for the next 1 hour(s) and 47 minute(s) [15:12:34] 'K, will sling out a Beta-only patch. [15:12:39] (NodeTextfileStale) resolved: Stale textfile for cloudvirt2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:12:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40624 and previous config saved to /var/cache/conftool/dbconfig/20221122-151245-ladsgroup.json [15:13:03] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Point Wikifunctions mobile links to the right place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859539 (https://phabricator.wikimedia.org/T314891) (owner: 10Jforrester) [15:13:32] !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:13:56] (03Merged) 10jenkins-bot: [Beta Cluster] Point Wikifunctions mobile links to the right place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859539 (https://phabricator.wikimedia.org/T314891) (owner: 10Jforrester) [15:14:00] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:14:42] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:04] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:15:07] (03PS2) 10Giuseppe Lavagetto: image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 [15:15:09] (03PS1) 10Giuseppe Lavagetto: Fixes for conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/859540 [15:15:11] (03PS1) 10Giuseppe Lavagetto: remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 [15:15:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P40625 and previous config saved to /var/cache/conftool/dbconfig/20221122-151532-marostegui.json [15:15:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fixes for conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/859540 (owner: 10Giuseppe Lavagetto) [15:15:48] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1012.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs10 [15:15:48] .wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1007.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:16:04] (03CR) 10CI reject: [V: 04-1] image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 (owner: 10Giuseppe Lavagetto) [15:16:08] (03CR) 10CI reject: [V: 04-1] remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 (owner: 10Giuseppe Lavagetto) [15:16:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:16:21] ryankemper / inflatador ^^ [15:16:28] gehel :eyes [15:16:30] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:16:30] PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:16:52] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs10 [15:16:52] .wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:16:57] inflatador: are there any ongoing work on wdqs / eqiad? [15:17:04] No [15:17:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40626 and previous config saved to /var/cache/conftool/dbconfig/20221122-151728-ladsgroup.json [15:17:44] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:17:45] surge in load, thread counts exploded [15:18:00] (03CR) 10Stang: "To deployer: this patch requires a maint script run, please read T323378#8413476" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378) (owner: 10Stang) [15:18:08] 10SRE, 10Machine-Learning-Team, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10calbon) a:03calbon [15:18:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:35] * akosiaris around [15:18:37] acking page [15:18:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P40627 and previous config saved to /var/cache/conftool/dbconfig/20221122-151846-marostegui.json [15:18:48] also around, thanks alex [15:18:51] akosiaris is that page for wdqs? [15:18:56] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:18:57] inflatador: yes [15:19:03] so probably related to specific queries? a bot abusing the service? [15:19:14] akosiaris we are looking into it now if that helps [15:19:19] around as well [15:19:28] inflatador: cool, thanks for letting us know [15:19:47] gehel: I see a 429 Too Many requests alert, so you are probably right [15:19:51] yes I suspect a bot with a bad query, let's try to restart all the nodes in eqiad [15:19:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:20:00] (03Merged) 10jenkins-bot: Fixes for conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/859540 (owner: 10Giuseppe Lavagetto) [15:20:12] dcausse cool, will get started on that immediately [15:20:16] WDQS is known to be somewhat unstable, and we're not shooting for a 99% availability. So no need to have everyone on deck at the moment [15:20:18] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:20:20] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:20:20] RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.121 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:20:37] gehel: cool, thanks for that info [15:21:05] inflatador: I'm assuming that you're on it and you'll scream for help as needed? [15:21:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:26] do you need an IC? [15:21:42] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:21:43] or does it seem simple enough with no need for more coordination? [15:22:15] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [15:22:21] (03PS1) 10Jbond: wmcs - sso: update ogin url to idp-dev [puppet] - 10https://gerrit.wikimedia.org/r/859542 [15:22:37] akosiaris: Let's see if it recovers after a restart of the services. If that's not the case, it's going to be more problematic and might need an IC. [15:22:48] akosiaris ^^ what gehel said [15:23:12] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:23:14] cool [15:23:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:23:39] ok, I'll stand by and watch what happens after restart. The page recovered fyi [15:23:54] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:24:03] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:24:44] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:25:03] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4009.ulsfo.wmnet with OS buster [15:26:03] (03PS1) 10JMeybohm: pontoon: Add .crt filename suffix to PKI root CA [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) [15:27:40] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [15:27:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P40628 and previous config saved to /var/cache/conftool/dbconfig/20221122-152751-ladsgroup.json [15:27:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:27:58] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:28:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:28:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P40629 and previous config saved to /var/cache/conftool/dbconfig/20221122-152813-ladsgroup.json [15:30:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P40630 and previous config saved to /var/cache/conftool/dbconfig/20221122-153023-ladsgroup.json [15:30:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321130)', diff saved to https://phabricator.wikimedia.org/P40631 and previous config saved to /var/cache/conftool/dbconfig/20221122-153038-marostegui.json [15:30:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1133.eqiad.wmnet with reason: Maintenance [15:30:44] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:30:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1133.eqiad.wmnet with reason: Maintenance [15:31:13] (03PS1) 10Kosta Harlan: GrowthExperiments: Allow accessing NewImpact module in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) [15:31:13] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [15:31:27] (03CR) 10Cathal Mooney: [C: 03+2] Unify routing-intstance config across JunOS devices [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:32:01] (03Merged) 10jenkins-bot: Unify routing-intstance config across JunOS devices [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [15:32:27] (03CR) 10Filippo Giunchedi: "Doh! thank you, LGTM (please see inline too)" [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) (owner: 10JMeybohm) [15:32:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P40632 and previous config saved to /var/cache/conftool/dbconfig/20221122-153235-ladsgroup.json [15:33:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321126)', diff saved to https://phabricator.wikimedia.org/P40633 and previous config saved to /var/cache/conftool/dbconfig/20221122-153352-marostegui.json [15:33:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:33:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:33:58] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [15:34:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40634 and previous config saved to /var/cache/conftool/dbconfig/20221122-153403-marostegui.json [15:34:07] (03PS2) 10JMeybohm: pontoon: Add .crt filename suffix to PKI root CA [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) [15:34:42] (03CR) 10JMeybohm: pontoon: Add .crt filename suffix to PKI root CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) (owner: 10JMeybohm) [15:34:58] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [15:34:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:36:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) (owner: 10JMeybohm) [15:37:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40635 and previous config saved to /var/cache/conftool/dbconfig/20221122-153727-marostegui.json [15:37:34] !log upgrading mwdebug2002 to 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 [15:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:39] !log upgrading mwdebug2002 to PHP 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 [15:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:48] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:02] (03CR) 10Jbond: [C: 03+2] wmcs - sso: update ogin url to idp-dev [puppet] - 10https://gerrit.wikimedia.org/r/859542 (owner: 10Jbond) [15:39:01] !log updating route-distinguisher for cloud vrf on cloud switches eqiad [15:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:57] (03PS1) 10Filippo Giunchedi: prometheus: move webperf jobs to 'ext' instance [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) [15:41:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1134.eqiad.wmnet with reason: Maintenance [15:41:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1134.eqiad.wmnet with reason: Maintenance [15:41:22] (03PS2) 10Kosta Harlan: GrowthExperiments: Allow accessing NewImpact module in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) [15:41:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T321130)', diff saved to https://phabricator.wikimedia.org/P40636 and previous config saved to /var/cache/conftool/dbconfig/20221122-154127-marostegui.json [15:41:33] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:43:22] !log importing php7.4 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 to apt.wikimedia.org T323358 [15:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40637 and previous config saved to /var/cache/conftool/dbconfig/20221122-154530-ladsgroup.json [15:45:32] (03PS3) 10Cathal Mooney: Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) [15:45:34] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38381/console" [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi) [15:45:37] (03CR) 10Cathal Mooney: [C: 03+2] Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [15:46:09] (03Merged) 10jenkins-bot: Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [15:49:46] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:11] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi) [15:50:32] (03CR) 10Cwhite: [C: 03+1] hiera: add graphite2004 to codfw graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [15:50:37] (03PS6) 10Vgutierrez: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [15:50:43] (03PS2) 10Giuseppe Lavagetto: remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 [15:50:45] (03PS3) 10Giuseppe Lavagetto: image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 [15:50:47] (03PS1) 10Giuseppe Lavagetto: Add conversion for ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/859567 [15:51:22] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: move webperf jobs to 'ext' instance [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi) [15:51:30] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:51:51] (03PS1) 10Kosta Harlan: GrowthExperiments: Run refreshUserImpactData maintenance script in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) [15:52:15] (03PS2) 10Kosta Harlan: GrowthExperiments: Run refreshUserImpactData maintenance script in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) [15:52:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P40638 and previous config saved to /var/cache/conftool/dbconfig/20221122-155234-marostegui.json [15:52:39] (03CR) 10Kosta Harlan: [C: 04-2] "Don't deploy until Id6eac58bd0ab36c02136486114010739bccc1ba1 is in group2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan) [15:54:25] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) [15:55:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321130)', diff saved to https://phabricator.wikimedia.org/P40639 and previous config saved to /var/cache/conftool/dbconfig/20221122-155523-marostegui.json [15:55:29] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:55:54] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [15:57:12] !log T323621 Add IPs for mw-web.svc and mw-api-ext.svc [15:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:18] T323621: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 [15:58:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [15:58:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [15:58:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/859065 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:59:07] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:00:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40640 and previous config saved to /var/cache/conftool/dbconfig/20221122-160036-ladsgroup.json [16:00:45] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) 05Open→03In progress [16:01:01] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [16:02:12] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:02:18] !log drain ganeti1027 for eventual reimage to Bullseye T311687 [16:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:23] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [16:03:53] (03CR) 10Hnowlan: [C: 03+1] sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [16:04:12] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:04:19] (03CR) 10Eevans: [C: 03+2] sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [16:07:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P40641 and previous config saved to /var/cache/conftool/dbconfig/20221122-160740-marostegui.json [16:08:38] (03Merged) 10jenkins-bot: sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [16:09:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:05] (03PS1) 10Clément Goubert: wmnet: Add mw-web, mw-api-ext [dns] - 10https://gerrit.wikimedia.org/r/859571 (https://phabricator.wikimedia.org/T323621) [16:10:15] (03CR) 10Elukey: Add a spark-operator chart and helmfile configuraiton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [16:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P40642 and previous config saved to /var/cache/conftool/dbconfig/20221122-161029-marostegui.json [16:10:48] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [16:10:52] PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [16:10:59] (03PS2) 10Bernard Wang: Update TOC to use PinnableHeader [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson) [16:11:21] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [16:11:38] (03PS3) 10Bernard Wang: Update TOC to use PinnableHeader [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson) [16:12:51] (03PS1) 10Bernard Wang: Fix icon button spacing in sticky header [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859508 (https://phabricator.wikimedia.org/T323176) [16:15:42] (03PS1) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) [16:15:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P40643 and previous config saved to /var/cache/conftool/dbconfig/20221122-161542-ladsgroup.json [16:15:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:15:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [16:15:49] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [16:16:30] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:18] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10Joe) >>! In T321874#8405699, @bking wrote: > >> I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in... [16:19:26] (03PS8) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) [16:19:30] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuraiton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [16:19:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:21:23] (03CR) 10Jbond: [C: 03+2] ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [16:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40644 and previous config saved to /var/cache/conftool/dbconfig/20221122-162247-marostegui.json [16:22:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2157.codfw.wmnet with reason: Maintenance [16:22:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2157.codfw.wmnet with reason: Maintenance [16:22:53] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:22:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T321126)', diff saved to https://phabricator.wikimedia.org/P40645 and previous config saved to /var/cache/conftool/dbconfig/20221122-162257-marostegui.json [16:24:33] (03PS2) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) [16:24:59] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:25:34] (03PS1) 10Jbond: P:swift::configure_disks: remove ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/859573 [16:25:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P40646 and previous config saved to /var/cache/conftool/dbconfig/20221122-162536-marostegui.json [16:25:53] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38387/console" [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [16:26:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321126)', diff saved to https://phabricator.wikimedia.org/P40647 and previous config saved to /var/cache/conftool/dbconfig/20221122-162621-marostegui.json [16:27:46] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [16:28:38] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [16:29:15] (03PS3) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) [16:29:50] (03CR) 10Jbond: [C: 03+2] P:swift::configure_disks: remove ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/859573 (owner: 10Jbond) [16:32:35] (03PS1) 10Filippo Giunchedi: Add new graphite hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/859575 (https://phabricator.wikimedia.org/T315524) [16:35:26] (03PS1) 10Cathal Mooney: Modify Homer config to ignore port speed warnings [puppet] - 10https://gerrit.wikimedia.org/r/859576 (https://phabricator.wikimedia.org/T303529) [16:35:41] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38388/console" [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert) [16:39:02] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/859576 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:39:27] (03CR) 10Krinkle: prometheus: move webperf jobs to 'ext' instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi) [16:39:59] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:40:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321130)', diff saved to https://phabricator.wikimedia.org/P40648 and previous config saved to /var/cache/conftool/dbconfig/20221122-164042-marostegui.json [16:40:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:40:49] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:40:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:41:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T321130)', diff saved to https://phabricator.wikimedia.org/P40649 and previous config saved to /var/cache/conftool/dbconfig/20221122-164104-marostegui.json [16:41:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P40650 and previous config saved to /var/cache/conftool/dbconfig/20221122-164128-marostegui.json [16:42:51] (03CR) 10Cathal Mooney: [C: 03+2] Modify Homer config to ignore port speed warnings [puppet] - 10https://gerrit.wikimedia.org/r/859576 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:44:15] (03PS5) 10Dzahn: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [16:44:25] (03PS13) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [16:45:19] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [16:47:38] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [16:48:32] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [16:49:22] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [16:51:15] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration management tooling - https://phabricator.wikimedia.org/T321874 (10Aklapper) [16:51:34] (03PS1) 10Filippo Giunchedi: [DNM] remove old graphite hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/859579 (https://phabricator.wikimedia.org/T315524) [16:52:04] (03PS6) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) [16:52:33] (03CR) 10Cathal Mooney: [C: 03+2] Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:53:10] (03Merged) 10jenkins-bot: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [16:53:38] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: move webperf jobs to 'ext' instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi) [16:53:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321130)', diff saved to https://phabricator.wikimedia.org/P40651 and previous config saved to /var/cache/conftool/dbconfig/20221122-165354-marostegui.json [16:54:00] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:55:48] (03PS1) 10Jbond: swift::mount_filesystem: allow overriding the mount point [puppet] - 10https://gerrit.wikimedia.org/r/859581 (https://phabricator.wikimedia.org/T308677) [16:56:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P40652 and previous config saved to /var/cache/conftool/dbconfig/20221122-165634-marostegui.json [16:57:34] (03PS14) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [16:58:16] (03CR) 10Jbond: [C: 03+2] swift::mount_filesystem: allow overriding the mount point [puppet] - 10https://gerrit.wikimedia.org/r/859581 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [16:58:19] (03PS1) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) [16:58:22] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [16:58:34] (03CR) 10Vgutierrez: [C: 03+2] node: Exclude trafficserver promfile mtime check (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [16:59:11] (03PS2) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858717 (https://phabricator.wikimedia.org/T323627) [17:00:04] jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:03:16] (03CR) 10Cathal Mooney: Add function to int_automation to validate QFX5120 port blocks (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [17:03:40] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:04:40] (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:08:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [17:09:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P40653 and previous config saved to /var/cache/conftool/dbconfig/20221122-170900-marostegui.json [17:09:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1006.eqiad.wmnet to drbd [17:09:40] (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:11:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321126)', diff saved to https://phabricator.wikimedia.org/P40654 and previous config saved to /var/cache/conftool/dbconfig/20221122-171141-marostegui.json [17:11:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:11:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:11:47] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:11:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40655 and previous config saved to /var/cache/conftool/dbconfig/20221122-171151-marostegui.json [17:12:32] !log btullis@cumin1001 Added views for new wiki: bclwikiquote T316456 [17:12:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [17:12:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P40656 and previous config saved to /var/cache/conftool/dbconfig/20221122-171235-ladsgroup.json [17:12:37] T316456: Prepare and check storage layer for bclwikiquote - https://phabricator.wikimedia.org/T316456 [17:13:24] (03CR) 10Urbanecm: GrowthExperiments: Allow accessing NewImpact module in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [17:13:40] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:15:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40657 and previous config saved to /var/cache/conftool/dbconfig/20221122-171519-marostegui.json [17:15:52] (03PS1) 10Jbond: swift: Allow for mounting using the device directly [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) [17:16:40] 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration management tooling - https://phabricator.wikimedia.org/T321874 (10bking) >>! In T321874#8413774, @Joe wrote: >>>! In T321874#8405699, @bking wrote: >> >>> I don't think there is a productive and actionable outcome of the discussion in... [17:17:33] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply config changes - bking@cumin2002 - T319020 [17:17:38] T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020 [17:18:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38391/console" [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:18:40] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:18:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:19:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1006.eqiad.wmnet to drbd [17:19:40] (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:21:32] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs4009 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/859065 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [17:22:22] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [17:23:40] (NodeTextfileStale) firing: (40) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:24:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P40658 and previous config saved to /var/cache/conftool/dbconfig/20221122-172407-marostegui.json [17:25:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1006.eqiad.wmnet to plain [17:25:57] 10SRE, 10observability, 10Epic, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942 (10Aklapper) [17:26:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1006.eqiad.wmnet to plain [17:27:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P40659 and previous config saved to /var/cache/conftool/dbconfig/20221122-172740-ladsgroup.json [17:28:25] (03PS1) 10Alexandros Kosiaris: felix: Instruct felix to set the src parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586 [17:28:40] (NodeTextfileStale) resolved: (32) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:28:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38392/console" [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:29:40] (NodeTextfileStale) resolved: (32) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:29:40] (03PS15) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [17:29:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38393/console" [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:30:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P40660 and previous config saved to /var/cache/conftool/dbconfig/20221122-173025-marostegui.json [17:30:43] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [17:31:05] (03PS2) 10Jbond: swift: Allow for mounting using the device directly [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) [17:31:18] (03PS2) 10Alexandros Kosiaris: felix: Instruct felix to set the src parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586 [17:32:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38394/console" [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:33:40] (03CR) 10Jbond: [V: 03+1 C: 03+2] swift: Allow for mounting using the device directly [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:34:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:38:11] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye [17:38:19] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye [17:39:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321130)', diff saved to https://phabricator.wikimedia.org/P40661 and previous config saved to /var/cache/conftool/dbconfig/20221122-173913-marostegui.json [17:39:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [17:39:19] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:39:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [17:42:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P40662 and previous config saved to /var/cache/conftool/dbconfig/20221122-174245-ladsgroup.json [17:45:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P40663 and previous config saved to /var/cache/conftool/dbconfig/20221122-174532-marostegui.json [17:45:53] !log btullis@cumin2002 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [17:47:10] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 6 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [17:48:17] (03PS1) 10Ladsgroup: mediawiki: Reduce the frequency of flaggedrevs updates [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) [17:48:20] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:50:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:51:56] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [17:54:12] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:36] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:54:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:55:48] !log btullis@cumin1001 Added views for new wiki: igwikiquote T314639 [17:55:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [17:55:54] T314639: Prepare and check storage layer for igwikiquote - https://phabricator.wikimedia.org/T314639 [17:56:39] !log btullis@cumin2002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [17:57:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P40664 and previous config saved to /var/cache/conftool/dbconfig/20221122-175750-ladsgroup.json [17:59:41] (03PS1) 10Jbond: swift: move ms-be2050 to new naming schema [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) [18:00:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40665 and previous config saved to /var/cache/conftool/dbconfig/20221122-180038-marostegui.json [18:00:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2178.codfw.wmnet with reason: Maintenance [18:00:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2178.codfw.wmnet with reason: Maintenance [18:00:45] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:00:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T321126)', diff saved to https://phabricator.wikimedia.org/P40666 and previous config saved to /var/cache/conftool/dbconfig/20221122-180049-marostegui.json [18:00:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:00:54] (03CR) 10Jbond: [C: 04-1] "self -1 as not sure of the consequences" [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [18:01:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:01:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T321130)', diff saved to https://phabricator.wikimedia.org/P40667 and previous config saved to /var/cache/conftool/dbconfig/20221122-180109-marostegui.json [18:01:15] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:04:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321126)', diff saved to https://phabricator.wikimedia.org/P40668 and previous config saved to /var/cache/conftool/dbconfig/20221122-180412-marostegui.json [18:07:46] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:11:10] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:21] (03PS3) 10AOkoth: vrts: add error checking [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059) [18:13:19] (03CR) 10AOkoth: vrts: add error checking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [18:13:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321130)', diff saved to https://phabricator.wikimedia.org/P40669 and previous config saved to /var/cache/conftool/dbconfig/20221122-181351-marostegui.json [18:13:58] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:14:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:18:04] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:02] (03CR) 10Dzahn: [C: 03+1] vrts: add error checking [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [18:19:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P40670 and previous config saved to /var/cache/conftool/dbconfig/20221122-181919-marostegui.json [18:28:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [18:28:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P40671 and previous config saved to /var/cache/conftool/dbconfig/20221122-182857-marostegui.json [18:30:00] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:22] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:48] !log installing pcre2 security updates [18:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:03] !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2041.codfw.wmnet with OS bullseye [18:34:10] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**) - Downtimed on Ic... [18:34:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P40672 and previous config saved to /var/cache/conftool/dbconfig/20221122-183428-marostegui.json [18:38:30] (03PS1) 10Muehlenhoff: Add library hint for pcre2 [puppet] - 10https://gerrit.wikimedia.org/r/859594 [18:39:56] RECOVERY - Check systemd state on dbstore1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P40673 and previous config saved to /var/cache/conftool/dbconfig/20221122-184404-marostegui.json [18:44:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:46:29] !log cr[34]-ulsfo: set routing-options static route 198.35.26.112/28 next-hop 10.128.0.9: T317247 [18:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:35] T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 [18:47:53] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for pcre2 [puppet] - 10https://gerrit.wikimedia.org/r/859594 (owner: 10Muehlenhoff) [18:48:27] !log decommissioning lvs4006: T317247 [18:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:38] (03CR) 10Ssingh: [C: 03+2] lvs4006: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/859086 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [18:48:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs4006.ulsfo.wmnet with reason: downtimed, in the process of decom [18:49:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs4006.ulsfo.wmnet with reason: downtimed, in the process of decom [18:49:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321126)', diff saved to https://phabricator.wikimedia.org/P40674 and previous config saved to /var/cache/conftool/dbconfig/20221122-184934-marostegui.json [18:49:40] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:51:28] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:52:12] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:52:18] ^ expected [18:52:47] (03PS2) 10Muehlenhoff: webperf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858605 (https://phabricator.wikimedia.org/T308013) [18:56:35] (03CR) 10Dzahn: [C: 03+1] webperf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858605 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:58:30] (03PS1) 10Ssingh: lvs4009: set as high-traffic2 primary LVS and remove lvs4006 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/859598 (https://phabricator.wikimedia.org/T317247) [18:59:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321130)', diff saved to https://phabricator.wikimedia.org/P40675 and previous config saved to /var/cache/conftool/dbconfig/20221122-185910-marostegui.json [18:59:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:59:17] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:59:36] (03CR) 10Muehlenhoff: [C: 03+2] webperf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858605 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:59:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1184.eqiad.wmnet with reason: Maintenance [18:59:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T321130)', diff saved to https://phabricator.wikimedia.org/P40676 and previous config saved to /var/cache/conftool/dbconfig/20221122-185943-marostegui.json [18:59:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:00:33] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs4006 [homer/public] - 10https://gerrit.wikimedia.org/r/859600 (https://phabricator.wikimedia.org/T317247) [19:01:49] 10SRE, 10ops-eqiad, 10Traffic: Host lvs1014.mgmt is down - https://phabricator.wikimedia.org/T322933 (10wiki_willy) a:03Jclark-ctr [19:02:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10wiki_willy) a:03Jclark-ctr [19:04:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:07:40] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:08:42] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:36] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:13:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs4006.ulsfo.wmnet [19:13:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321130)', diff saved to https://phabricator.wikimedia.org/P40677 and previous config saved to /var/cache/conftool/dbconfig/20221122-191337-marostegui.json [19:13:43] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:17:20] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [19:18:53] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye [19:19:00] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye [19:19:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs4006.ulsfo.wmnet [19:19:51] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs4006.ulsfo.wmnet` - lvs4006.ulsfo.wmnet (**WARN**) - D... [19:21:19] (03CR) 10Ssingh: [C: 03+2] lvs4009: set as high-traffic2 primary LVS and remove lvs4006 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/859598 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [19:21:42] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:24] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [19:22:58] (03CR) 10Ssingh: [V: 03+2 C: 03+2] "Forgive me for merging this without review but it's a removal of a host that was decommissioned and it will alert otherwise!" [homer/public] - 10https://gerrit.wikimedia.org/r/859600 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [19:24:08] (03Merged) 10jenkins-bot: sites.yaml: remove decommissioned host lvs4006 [homer/public] - 10https://gerrit.wikimedia.org/r/859600 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [19:24:16] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [19:24:51] !log running homer for Gerrit 859600: lvs4006 decommission [19:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:04] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye [19:28:13] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**) - Removed from Pu... [19:28:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P40678 and previous config saved to /var/cache/conftool/dbconfig/20221122-192844-marostegui.json [19:32:42] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:33] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet'] [19:42:47] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet'] [19:43:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P40679 and previous config saved to /var/cache/conftool/dbconfig/20221122-194350-marostegui.json [19:44:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:46:21] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet'] [19:46:28] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet'] [19:47:26] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet'] [19:47:30] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet'] [19:47:36] hmmm [19:49:24] (03PS1) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) [19:49:55] (03CR) 10AOkoth: [C: 03+2] vrts: add error checking [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [19:50:06] (03CR) 10CI reject: [V: 04-1] install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [19:50:31] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet'] [19:50:38] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet'] [19:50:42] (03PS2) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) [19:50:57] (03PS3) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) [19:51:39] (03CR) 10CI reject: [V: 04-1] install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [19:51:40] HTTPS interface it is then for now :) [19:51:46] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:51:52] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [19:53:51] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye [19:54:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye [19:54:20] (03CR) 10Jbond: install_server: Add dynamic raid configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [19:55:16] (03PS4) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) [19:58:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321130)', diff saved to https://phabricator.wikimedia.org/P40680 and previous config saved to /var/cache/conftool/dbconfig/20221122-195857-marostegui.json [19:58:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1186.eqiad.wmnet with reason: Maintenance [19:59:03] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:59:18] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041'] [19:59:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1186.eqiad.wmnet with reason: Maintenance [19:59:24] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041'] [19:59:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T321130)', diff saved to https://phabricator.wikimedia.org/P40681 and previous config saved to /var/cache/conftool/dbconfig/20221122-195929-marostegui.json [19:59:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10wiki_willy) a:03Jclark-ctr [20:02:09] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10wiki_willy) Hi @MoritzMuehlenhoff - thanks for the heads up on IRC. @Papaul will be taking a look at the host, to wrap up the installation by the end of the week. Thanks, Willy [20:03:26] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041'] [20:03:31] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041'] [20:03:50] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet'] [20:03:53] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet'] [20:04:20] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet'] [20:04:24] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet'] [20:04:44] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041'] [20:04:47] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041'] [20:04:50] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041'] [20:04:55] !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye [20:05:00] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041'] [20:05:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**) - Removed from Pu... [20:05:34] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041'] [20:05:41] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041'] [20:07:22] !log sudo ipmitool -I lanplus -H "cp2041.mgmt.codfw.wmnet" -U root -E chassis power cycle [20:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321130)', diff saved to https://phabricator.wikimedia.org/P40682 and previous config saved to /var/cache/conftool/dbconfig/20221122-201140-marostegui.json [20:11:46] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:16:26] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:31] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:19:50] (03PS1) 10Stevemunene: Allow introspection for production environment [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778) [20:19:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:20:28] (03CR) 10CI reject: [V: 04-1] Allow introspection for production environment [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene) [20:21:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:23:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host puppetdb1003.mgmt.eqiad.wmnet with reboot policy FORCED [20:25:49] (03PS2) 10Stevemunene: Allow introspection for production environment [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778) [20:26:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P40683 and previous config saved to /var/cache/conftool/dbconfig/20221122-202646-marostegui.json [20:32:48] hello! just to note i have 2 patches for the deployment window in 30 min, but i have to step away for the next hour, so i will be back 30 min after the deployment window starts [20:33:23] sorry, i hope its not too inconvenient to the deployer! s [20:36:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetdb1003.mgmt.eqiad.wmnet with reboot policy FORCED [20:41:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P40684 and previous config saved to /var/cache/conftool/dbconfig/20221122-204153-marostegui.json [20:48:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['puppetdb1003'] [20:52:30] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [20:54:53] (03PS3) 10Samtar: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858717 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang) [20:55:02] (03PS3) 10Samtar: zhwiki: Install PageTriage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378) (owner: 10Stang) [20:57:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321130)', diff saved to https://phabricator.wikimedia.org/P40685 and previous config saved to /var/cache/conftool/dbconfig/20221122-205659-marostegui.json [20:57:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1196.eqiad.wmnet with reason: Maintenance [20:57:06] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:57:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1196.eqiad.wmnet with reason: Maintenance [20:57:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T321130)', diff saved to https://phabricator.wikimedia.org/P40686 and previous config saved to /var/cache/conftool/dbconfig/20221122-205720-marostegui.json [20:57:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['puppetdb1003'] [20:58:07] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul) [20:58:22] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['puppetdb1003'] [20:58:55] (03CR) 10Btullis: [C: 03+1] "Great! Thanks Steve." [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene) [20:59:05] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10XenoRyet) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T2100). [21:00:04] bwang and cirno: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] I can deploy :) [21:00:31] o/ [21:00:58] cirno: I'm going to start with your 858717, then run the maintenance script for 858705 [21:01:11] well please wait [21:01:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858717 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang) [21:01:33] cirno: stop? ^ [21:01:44] I removed the logo one as maybe put it for a while is better [21:01:50] !log samtar@deploy1002 backport aborted: (duration: 00m 33s) [21:02:02] (03Merged) 10jenkins-bot: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858717 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang) [21:02:35] just refresh the latest calendar :) [21:02:51] so maybe revert this one? not sure what to do [21:02:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [21:02:57] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [21:03:10] ack [21:03:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [21:03:32] (03PS1) 10Samtar: Revert "Update favicon and CentralAuthLoginIcon for wikifunctionswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859509 [21:03:57] (03CR) 10Samtar: [C: 03+2] "Reverting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859509 (owner: 10Samtar) [21:04:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['puppetdb1003'] [21:04:44] (03Merged) 10jenkins-bot: Revert "Update favicon and CentralAuthLoginIcon for wikifunctionswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859509 (owner: 10Samtar) [21:04:57] cirno: okay, reverted :) [21:05:26] I'll try the maintenance script for 858705 now [21:05:38] thanks, left a message on the relevant task :) [21:06:50] cirno: that seems to have worked as expected :) [21:07:05] TheresNoTime: do beta cluster support wikimediadebug or not? So I can access mwdebug1001 during the deploy [21:07:17] cirno: it does not afaik [21:07:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378) (owner: 10Stang) [21:08:06] (03Merged) 10jenkins-bot: zhwiki: Install PageTriage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378) (owner: 10Stang) [21:08:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [21:10:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321130)', diff saved to https://phabricator.wikimedia.org/P40687 and previous config saved to /var/cache/conftool/dbconfig/20221122-211049-marostegui.json [21:10:55] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [21:10:56] (03PS1) 10Stang: Revert "Revert "Update favicon and CentralAuthLoginIcon for wikifunctionswiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) [21:11:24] (03PS2) 10Stang: Revert "Revert "Update favicon and CentralAuthLoginIcon for wikifunctionswiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627) [21:11:47] cirno: just waiting for `beta-code-update-eqiad` to finish, then that'll hopefully be live on beta [21:11:54] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:12:30] s/`beta-code-update-eqiad`/`beta-scap-sync-world` [21:12:58] TheresNoTime: hi! thanks for deploying -- bwang mentioned to me earlier today that he'll be available about halfway thru this backport window - he should be around in 15 [21:13:42] cjming: no worries, they also left a message on the calendar which I saw :) [21:13:55] cirno: that's live on beta, and looking at https://zh.wikipedia.beta.wmflabs.org/wiki/Special:%E7%89%88%E6%9C%AC it seems to be enabled at least? [21:14:21] looking [21:16:18] https://zh.wikipedia.beta.wmflabs.org/wiki/Special:%E6%96%B0%E9%A1%B5%E9%9D%A2%E4%BE%9B%E7%BB%99 loads, not sure if no results is expected to be honest.. [21:16:23] yeah I could see PageTriage appeared in Special:version, but there's something weird like https://zh.wikipedia.beta.wmflabs.org/wiki/Special:%E6%96%B0%E9%A1%B5%E9%9D%A2%E4%BE%9B%E7%BB%99 contains nothing... is it something expected [21:17:14] I'm creating a new article with alt account to test [21:20:22] I created a new page called https://zh.wikipedia.beta.wmflabs.org/wiki/12345, and the pagetriage tool on the right hand side appears, so LGTM! [21:20:52] ack, noting T323647 [21:20:52] T323647: PHP Notice: Undefined index: afc_state - https://phabricator.wikimedia.org/T323647 [21:22:18] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [21:23:11] guess that's somewhat expected [21:25:54] bwang: lemme know when you're about for your patches :) [21:25:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P40688 and previous config saved to /var/cache/conftool/dbconfig/20221122-212556-marostegui.json [21:29:23] 10SRE, 10observability, 10serviceops, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10jijiki) >>! In T224454#8411988, @elukey wrote: > An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutte... [21:30:00] (03CR) 10Samtar: [C: 03+2] "starting deploy" [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859508 (https://phabricator.wikimedia.org/T323176) (owner: 10Bernard Wang) [21:31:47] (03CR) 10Samtar: [C: 03+2] "starting deploy" [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson) [21:32:31] TheresNoTime: hi i'm back and ready! [21:32:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED [21:33:01] bwang: hi! I've just started off the patches merging, seeing as they take ~10 minutes [21:33:25] gotcha, just lmk! [21:33:25] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1004'] [21:35:53] 10SRE, 10serviceops, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Aklapper) [21:41:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P40689 and previous config saved to /var/cache/conftool/dbconfig/20221122-214103-marostegui.json [21:43:53] (03Merged) 10jenkins-bot: Fix icon button spacing in sticky header [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859508 (https://phabricator.wikimedia.org/T323176) (owner: 10Bernard Wang) [21:44:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859508 (https://phabricator.wikimedia.org/T323176) (owner: 10Bernard Wang) [21:44:17] !log samtar@deploy1002 Started scap: Backport for [[gerrit:859508|Fix icon button spacing in sticky header (T323176)]] [21:44:18] bwang: starting with 859508 :) [21:44:22] T323176: [S] Sticky header icon buttons are missing padding - https://phabricator.wikimedia.org/T323176 [21:44:39] !log samtar@deploy1002 samtar and bwang: Backport for [[gerrit:859508|Fix icon button spacing in sticky header (T323176)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:44:51] bwang: that's live on mwdebug now, can you test? [21:45:14] yep, but which one? [21:45:29] (03Merged) 10jenkins-bot: Update TOC to use PinnableHeader [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson) [21:45:31] use mwdebug1001 :) [21:45:31] or does it not matter which number [21:45:38] (doesn't matter afaik) [21:47:25] great the first patch looks good [21:47:33] syncing that patch now [21:48:59] is the second one also ready to test? [21:49:34] bwang: not yet, be about ~5 minutes :) [21:49:46] 👍 [21:49:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:51:42] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:859508|Fix icon button spacing in sticky header (T323176)]] (duration: 07m 25s) [21:51:47] T323176: [S] Sticky header icon buttons are missing padding - https://phabricator.wikimedia.org/T323176 [21:51:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson) [21:52:03] !log samtar@deploy1002 Started scap: Backport for [[gerrit:859076|Update TOC to use PinnableHeader (T317897)]] [21:52:08] T317897: [L] [Page Tools] Make the page tools menu pinnable - https://phabricator.wikimedia.org/T317897 [21:52:12] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [21:52:24] !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:859076|Update TOC to use PinnableHeader (T317897)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:52:27] bwang: 859508 should be live everywhere now, and 859076 is available to test on mwdebug1001 [21:54:00] 859076 looks good too! [21:54:07] syncin'! [21:56:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321130)', diff saved to https://phabricator.wikimedia.org/P40690 and previous config saved to /var/cache/conftool/dbconfig/20221122-215610-marostegui.json [21:56:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [21:56:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [21:56:15] thanks!! [21:56:16] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [21:58:15] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:859076|Update TOC to use PinnableHeader (T317897)]] (duration: 06m 11s) [21:58:18] np! both patches should be live everywhere now :) [21:58:20] T317897: [L] [Page Tools] Make the page tools menu pinnable - https://phabricator.wikimedia.org/T317897 [21:58:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbprov1004'] [21:59:56] !log close UTC late backport window [22:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance [22:06:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance [22:14:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:15:23] !log phab1004 - rsyncing /srv/repos from phab1001 with 2Mbit bwlimit [22:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:03] !log phab1004 - rsyncing /srv/repos from phab1001 with 2Mbit bwlimit - pulling - rsync -avp --bwlimit=2m --delete rsync://phab1001.eqiad.wmnet/srv-repos/ /srv/repos/ - T280597 [22:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:09] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [22:17:57] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul) [22:19:16] 10SRE, 10ops-codfw: Troubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10Papaul) 05Open→03Resolved This is now fixed by @jbond and @Volans [22:19:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2102.codfw.wmnet with reason: Maintenance [22:19:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2102.codfw.wmnet with reason: Maintenance [22:22:08] PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [22:24:03] !log temp disabling puppet on 17 hosts using rsync::quickdatacopy to carefully deploy gerrit:715636 allowing multiple dest hosts for syncing [22:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Papaul) [22:30:05] (03PS1) 10Papaul: Add puppetdb1003 and dbprov1004 to site.pp and netboox.cfg [puppet] - 10https://gerrit.wikimedia.org/r/859625 (https://phabricator.wikimedia.org/T317892) [22:30:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2103.codfw.wmnet with reason: Maintenance [22:30:24] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1004'] [22:30:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2103.codfw.wmnet with reason: Maintenance [22:30:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T321130)', diff saved to https://phabricator.wikimedia.org/P40691 and previous config saved to /var/cache/conftool/dbconfig/20221122-223047-marostegui.json [22:30:53] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [22:30:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbprov1004'] [22:31:44] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1004'] [22:32:24] (03CR) 10Papaul: [C: 03+2] Add puppetdb1003 and dbprov1004 to site.pp and netboox.cfg [puppet] - 10https://gerrit.wikimedia.org/r/859625 (https://phabricator.wikimedia.org/T317892) (owner: 10Papaul) [22:34:31] !log phabricator: on phab1001 user 'phd' is UID 497, on pahb1004 user 'phd' is UID 920 (this is desired and a fix!) - but also..because uid 497 was now free.. it became the UID of user 'vcs' on phab1004 while on phab1001 user 'vcs' is uid 498. so we use "find /srv/repos -uid 497 -exec chown phd {} \;" to give files owned by 497 to phd. T280597 [22:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:36] T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597 [22:36:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host puppetdb1003.eqiad.wmnet with OS bullseye [22:36:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bullseye [22:37:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul) [22:37:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Papaul) [22:37:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbprov1004'] [22:38:50] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1004'] [22:39:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:43:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T321130)', diff saved to https://phabricator.wikimedia.org/P40692 and previous config saved to /var/cache/conftool/dbconfig/20221122-224321-marostegui.json [22:43:28] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [22:44:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:48:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage [22:52:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage [22:54:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:58:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P40693 and previous config saved to /var/cache/conftool/dbconfig/20221122-225828-marostegui.json [22:59:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbprov1004'] [22:59:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:02:42] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "compiled on all 17 hosts that use this (list to paste into compiler from cumin command: sudo cumin --no-colors 'R:rsync::quickdatacopy' 2>" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [23:06:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetdb1003.eqiad.wmnet with OS bullseye [23:07:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bullseye completed: - puppetdb1003 (**PASS**)... [23:11:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul) [23:11:30] (03CR) 10Dzahn: "change for multiple dest hosts - merged and deployed - unblocking you" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [23:12:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff this complete [23:13:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P40694 and previous config saved to /var/cache/conftool/dbconfig/20221122-231334-marostegui.json [23:13:42] (03PS2) 10Dzahn: phabricator: set mysql master port for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/859145 (https://phabricator.wikimedia.org/T280597) [23:16:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov1004.eqiad.wmnet with OS bullseye [23:17:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov1004.eqiad.wmnet with OS bullseye [23:17:38] (03PS1) 10Dzahn: phabricator: let phd run on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/859628 (https://phabricator.wikimedia.org/T280597) [23:24:01] (03PS1) 10Dzahn: phabricator: move some more settings from host file to common [puppet] - 10https://gerrit.wikimedia.org/r/859631 (https://phabricator.wikimedia.org/T280597) [23:28:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T321130)', diff saved to https://phabricator.wikimedia.org/P40695 and previous config saved to /var/cache/conftool/dbconfig/20221122-232841-marostegui.json [23:28:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2116.codfw.wmnet with reason: Maintenance [23:28:47] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [23:28:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2116.codfw.wmnet with reason: Maintenance [23:29:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T321130)', diff saved to https://phabricator.wikimedia.org/P40696 and previous config saved to /var/cache/conftool/dbconfig/20221122-232903-marostegui.json [23:41:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321130)', diff saved to https://phabricator.wikimedia.org/P40697 and previous config saved to /var/cache/conftool/dbconfig/20221122-234134-marostegui.json [23:41:41] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [23:44:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:50:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1004.eqiad.wmnet with reason: host reimage [23:53:21] (03PS1) 10Daimona Eaytoy: Create list of users who can test the CampaignEvents extension on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859634 (https://phabricator.wikimedia.org/T316227) [23:53:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1004.eqiad.wmnet with reason: host reimage [23:54:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:56:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P40698 and previous config saved to /var/cache/conftool/dbconfig/20221122-235641-marostegui.json [23:57:01] (03PS1) 10Daimona Eaytoy: Configure the CampaignEvents ext to use the x1.wikishared db for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859635 (https://phabricator.wikimedia.org/T322745) [23:58:16] (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859636 (https://phabricator.wikimedia.org/T322745)