[00:00:06] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:10] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40405 and previous config saved to /var/cache/conftool/dbconfig/20221122-000638-ladsgroup.json
[00:06:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance
[00:06:45] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[00:06:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance
[00:07:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T323214)', diff saved to https://phabricator.wikimedia.org/P40406 and previous config saved to /var/cache/conftool/dbconfig/20221122-000700-ladsgroup.json
[00:07:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P40407 and previous config saved to /var/cache/conftool/dbconfig/20221122-000739-ladsgroup.json
[00:09:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P40408 and previous config saved to /var/cache/conftool/dbconfig/20221122-000904-ladsgroup.json
[00:19:27] <wikibugs>	 (03PS3) 10BCornwall: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658
[00:21:32] <icinga-wm>	 PROBLEM - PHD should be running on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[00:22:01] <jinxer-wm>	 (ProbeDown) firing: (2) Service phab1001:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:22:44] <rzl>	 hello
[00:22:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P40409 and previous config saved to /var/cache/conftool/dbconfig/20221122-002245-ladsgroup.json
[00:22:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[00:22:52] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[00:23:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[00:23:40] <rzl>	 mutante: phab1001 is the old host, right? so that alert is bogus?
[00:24:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T322618)', diff saved to https://phabricator.wikimedia.org/P40410 and previous config saved to /var/cache/conftool/dbconfig/20221122-002411-ladsgroup.json
[00:24:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[00:24:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[00:28:52] <rzl>	 oh I see, it's an expired silence
[00:31:05] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on phab1001.eqiad.wmnet with reason: T322250
[00:31:11] <stashbot>	 T322250: decom phab2001 (service owner) - https://phabricator.wikimedia.org/T322250
[00:31:20] <mutante>	 rzl: yes, sorry. fixed. I thought 2 hours was plenty. it was not
[00:31:21] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on phab1001.eqiad.wmnet with reason: T322250
[00:31:23] <rzl>	 was just about to ask if I can do that, thanks :)
[00:31:45] <rzl>	 no worries! thanks for the work
[00:32:10] <rzl>	 resolving in VO
[00:33:04] <mutante>	 thanks
[00:41:34] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:41:52] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:47:14] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:08:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[01:13:36] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250
[01:13:42] <stashbot>	 T322250: decom phab2001 (service owner) - https://phabricator.wikimedia.org/T322250
[01:13:51] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250
[01:14:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40411 and previous config saved to /var/cache/conftool/dbconfig/20221122-011404-ladsgroup.json
[01:14:10] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[01:16:21] <wikibugs>	 (03PS1) 10Dzahn: Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF" [dns] - 10https://gerrit.wikimedia.org/r/859077
[01:16:56] <wikibugs>	 (03PS1) 10Dzahn: Revert "hieradata: switch active Phabricator server to phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/859078
[01:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:24:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:25:03] <brennen>	 !log reverting to phab1001; short phabricator downtime incoming while DNS changes are made (T280597)
[01:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:09] <stashbot>	 T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597
[01:26:45] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250
[01:26:48] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250
[01:27:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: switch from phab1001 to phab1004, discovery and SPF" [dns] - 10https://gerrit.wikimedia.org/r/859077 (owner: 10Dzahn)
[01:27:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "hieradata: switch active Phabricator server to phab1004" [puppet] - 10https://gerrit.wikimedia.org/r/859078 (owner: 10Dzahn)
[01:28:52] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@f68dc24]: deploy config changes for phab1004 -> phab1001 revert
[01:29:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to  and previous config saved to /var/cache/conftool/dbconfig/20221122-012910-ladsgroup.json
[01:29:49] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@f68dc24]: deploy config changes for phab1004 -> phab1001 revert (duration: 00m 56s)
[01:35:22] <icinga-wm>	 RECOVERY - PHD should be running on phab1001 is OK: PROCS OK: 1 process with regex args php ./phd-daemon, UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to  and previous config saved to /var/cache/conftool/dbconfig/20221122-014417-ladsgroup.json
[01:51:03] <mutante>	 we had to revert. for now phab1001 is the prod server again despite earlier comments
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:55:54] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.remove-downtime for phab1001.eqiad.wmnet
[01:55:54] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for phab1001.eqiad.wmnet
[01:56:16] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250
[01:56:20] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on phab1004.eqiad.wmnet with reason: T322250
[01:56:21] <stashbot>	 T322250: decom phab2001 (service owner) - https://phabricator.wikimedia.org/T322250
[01:59:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40412 and previous config saved to /var/cache/conftool/dbconfig/20221122-015923-ladsgroup.json
[01:59:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[01:59:29] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[01:59:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[02:06:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T323214)', diff saved to https://phabricator.wikimedia.org/P40413 and previous config saved to /var/cache/conftool/dbconfig/20221122-020628-ladsgroup.json
[02:06:34] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:21:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P40414 and previous config saved to /var/cache/conftool/dbconfig/20221122-022134-ladsgroup.json
[02:23:55] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:24:54] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:36:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P40415 and previous config saved to /var/cache/conftool/dbconfig/20221122-023641-ladsgroup.json
[02:51:16] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:51:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T323214)', diff saved to https://phabricator.wikimedia.org/P40416 and previous config saved to /var/cache/conftool/dbconfig/20221122-025148-ladsgroup.json
[02:51:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[02:51:54] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[02:52:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[02:52:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40417 and previous config saved to /var/cache/conftool/dbconfig/20221122-025209-ladsgroup.json
[02:52:24] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:55:12] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48976 bytes in 3.633 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:56:16] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.377 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T0300)
[03:49:49] <wikibugs>	 (03PS1) 10KartikMistry: Make Western Frisian Wikipedia Machine Translation stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859161 (https://phabricator.wikimedia.org/T323415)
[04:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T0400)
[04:00:38] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:01:16] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:03:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[04:04:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[04:04:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[04:04:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[04:04:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T323214)', diff saved to https://phabricator.wikimedia.org/P40418 and previous config saved to /var/cache/conftool/dbconfig/20221122-040429-ladsgroup.json
[04:04:35] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[04:54:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40419 and previous config saved to /var/cache/conftool/dbconfig/20221122-045406-ladsgroup.json
[04:54:12] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[05:07:04] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:08:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[05:09:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P40420 and previous config saved to /var/cache/conftool/dbconfig/20221122-050912-ladsgroup.json
[05:24:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P40421 and previous config saved to /var/cache/conftool/dbconfig/20221122-052419-ladsgroup.json
[05:25:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST authorizationpolicies) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[05:39:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T323214)', diff saved to https://phabricator.wikimedia.org/P40422 and previous config saved to /var/cache/conftool/dbconfig/20221122-053925-ladsgroup.json
[05:39:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[05:39:32] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[05:39:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance
[05:39:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T323214)', diff saved to https://phabricator.wikimedia.org/P40423 and previous config saved to /var/cache/conftool/dbconfig/20221122-053947-ladsgroup.json
[06:03:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T323214)', diff saved to https://phabricator.wikimedia.org/P40424 and previous config saved to /var/cache/conftool/dbconfig/20221122-060315-ladsgroup.json
[06:03:21] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[06:18:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P40425 and previous config saved to /var/cache/conftool/dbconfig/20221122-061821-ladsgroup.json
[06:23:54] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:24:54] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:26:10] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P40426 and previous config saved to /var/cache/conftool/dbconfig/20221122-063328-ladsgroup.json
[06:44:46] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:44:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:48:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T323214)', diff saved to https://phabricator.wikimedia.org/P40427 and previous config saved to /var/cache/conftool/dbconfig/20221122-064834-ladsgroup.json
[06:48:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[06:48:41] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[06:48:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[06:48:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T323214)', diff saved to https://phabricator.wikimedia.org/P40428 and previous config saved to /var/cache/conftool/dbconfig/20221122-064856-ladsgroup.json
[06:50:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T323116
[06:50:29] <stashbot>	 T323116: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T323116
[06:50:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T323116
[06:52:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1162 with weight 0 T323116', diff saved to https://phabricator.wikimedia.org/P40429 and previous config saved to /var/cache/conftool/dbconfig/20221122-065219-ladsgroup.json
[06:56:44] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T0700).
[07:00:23] <Amir1>	 need a couple of minutes to finish the topology move
[07:08:35] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1157 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/858380 (https://phabricator.wikimedia.org/T323546)
[07:08:38] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/858381 (https://phabricator.wikimedia.org/T323546)
[07:08:56] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:09:06] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/858382 (https://phabricator.wikimedia.org/T323547)
[07:09:10] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/858383 (https://phabricator.wikimedia.org/T323547)
[07:12:33] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: scap: add mw on k8s dsh list [puppet] - 10https://gerrit.wikimedia.org/r/858988 (https://phabricator.wikimedia.org/T323349)
[07:12:46] <Amir1>	 done now 
[07:13:40] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: blubberoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/858206
[07:13:42] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/856496 (https://phabricator.wikimedia.org/T323116) (owner: 10Gerrit maintenance bot)
[07:13:45] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/856496 (https://phabricator.wikimedia.org/T323116) (owner: 10Gerrit maintenance bot)
[07:14:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38374/console" [puppet] - 10https://gerrit.wikimedia.org/r/858988 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto)
[07:14:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T323214)', diff saved to https://phabricator.wikimedia.org/P40430 and previous config saved to /var/cache/conftool/dbconfig/20221122-071442-ladsgroup.json
[07:14:49] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[07:16:48] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:17:03] <Amir1>	 !log Starting s2 eqiad failover from db1122 to db1162 - T323116
[07:17:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:08] <stashbot>	 T323116: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T323116
[07:17:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T323116', diff saved to https://phabricator.wikimedia.org/P40431 and previous config saved to /var/cache/conftool/dbconfig/20221122-071727-ladsgroup.json
[07:17:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1162 to s2 primary and set section read-write T323116', diff saved to https://phabricator.wikimedia.org/P40432 and previous config saved to /var/cache/conftool/dbconfig/20221122-071759-ladsgroup.json
[07:21:25] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/856497 (https://phabricator.wikimedia.org/T323116) (owner: 10Gerrit maintenance bot)
[07:21:34] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/856497 (https://phabricator.wikimedia.org/T323116) (owner: 10Gerrit maintenance bot)
[07:22:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[07:22:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[07:22:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40433 and previous config saved to /var/cache/conftool/dbconfig/20221122-072233-marostegui.json
[07:22:39] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[07:23:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/858206 (owner: 10Giuseppe Lavagetto)
[07:25:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40434 and previous config saved to /var/cache/conftool/dbconfig/20221122-072505-marostegui.json
[07:28:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1122 T323116', diff saved to https://phabricator.wikimedia.org/P40435 and previous config saved to /var/cache/conftool/dbconfig/20221122-072802-ladsgroup.json
[07:28:08] <stashbot>	 T323116: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T323116
[07:28:20] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:28:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[07:29:00] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/858206 (owner: 10Giuseppe Lavagetto)
[07:29:04] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:29:12] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[07:29:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40436 and previous config saved to /var/cache/conftool/dbconfig/20221122-072918-marostegui.json
[07:29:24] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[07:29:30] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Marostegui) >>! In T323512#8410369, @jcrespo wrote: > @Papaul, I wonder if we could do a "simple" test of checking the power supply redundancy by "pulling the plug" (literally or just pushing the on/off button) to che...
[07:29:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P40437 and previous config saved to /var/cache/conftool/dbconfig/20221122-072949-ladsgroup.json
[07:30:03] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply
[07:32:26] <wikibugs>	 (03PS1) 10Marostegui: db2174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/859328 (https://phabricator.wikimedia.org/T323512)
[07:33:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/859328 (https://phabricator.wikimedia.org/T323512) (owner: 10Marostegui)
[07:33:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[07:33:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[07:39:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[07:39:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[07:40:06] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply
[07:40:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dse-k8s-etcd1003.eqiad.wmnet with reason: rack move of ganeti1012
[07:40:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P40438 and previous config saved to /var/cache/conftool/dbconfig/20221122-074011-marostegui.json
[07:40:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dse-k8s-etcd1003.eqiad.wmnet with reason: rack move of ganeti1012
[07:40:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubetcd1004.eqiad.wmnet with reason: rack move of ganeti1012
[07:40:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubetcd1004.eqiad.wmnet with reason: rack move of ganeti1012
[07:41:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ml-etcd1002.eqiad.wmnet with reason: rack move of ganeti1012
[07:41:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ml-etcd1002.eqiad.wmnet with reason: rack move of ganeti1012
[07:42:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10MoritzMuehlenhoff) ganeti1012 can be powered down for the rack move; the remaining three VMs are redundant and have been silenced in monitoring.
[07:43:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40439 and previous config saved to /var/cache/conftool/dbconfig/20221122-074323-marostegui.json
[07:43:29] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[07:44:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T323214)', diff saved to https://phabricator.wikimedia.org/P40440 and previous config saved to /var/cache/conftool/dbconfig/20221122-074400-ladsgroup.json
[07:44:06] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[07:44:35] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply
[07:44:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P40441 and previous config saved to /var/cache/conftool/dbconfig/20221122-074455-ladsgroup.json
[07:49:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: deployment_server::k8s: add new data structure for modules [puppet] - 10https://gerrit.wikimedia.org/r/859430
[07:50:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment_server::k8s: add new data structure for modules [puppet] - 10https://gerrit.wikimedia.org/r/859430 (owner: 10Giuseppe Lavagetto)
[07:51:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Retire two k8s Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/858587 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[07:52:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove dumpsdata100XH750.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/858589 (owner: 10Muehlenhoff)
[07:54:39] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply
[07:55:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P40442 and previous config saved to /var/cache/conftool/dbconfig/20221122-075518-marostegui.json
[07:56:41] <wikibugs>	 10SRE, 10observability, 10serviceops, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10elukey) An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutter pool, that is not something we wish for...
[07:57:43] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for Dom Walden - https://phabricator.wikimedia.org/T323549 (10dom_walden)
[07:58:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Retire obsolete cloudvirt Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/859431 (https://phabricator.wikimedia.org/T156955)
[07:58:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P40443 and previous config saved to /var/cache/conftool/dbconfig/20221122-075829-marostegui.json
[07:58:34] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply
[07:58:44] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply
[07:59:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P40444 and previous config saved to /var/cache/conftool/dbconfig/20221122-075907-ladsgroup.json
[07:59:24] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply
[08:00:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T323214)', diff saved to https://phabricator.wikimedia.org/P40445 and previous config saved to /var/cache/conftool/dbconfig/20221122-080002-ladsgroup.json
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:08] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[08:00:38] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply
[08:08:30] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply
[08:09:51] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply
[08:10:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[08:10:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[08:10:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40446 and previous config saved to /var/cache/conftool/dbconfig/20221122-081024-marostegui.json
[08:10:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[08:10:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[08:10:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40447 and previous config saved to /var/cache/conftool/dbconfig/20221122-081029-ladsgroup.json
[08:10:32] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[08:10:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T321126)', diff saved to https://phabricator.wikimedia.org/P40448 and previous config saved to /var/cache/conftool/dbconfig/20221122-081035-marostegui.json
[08:10:45] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[08:12:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/859431 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[08:12:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40449 and previous config saved to /var/cache/conftool/dbconfig/20221122-081239-ladsgroup.json
[08:13:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T321126)', diff saved to https://phabricator.wikimedia.org/P40450 and previous config saved to /var/cache/conftool/dbconfig/20221122-081307-marostegui.json
[08:13:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P40451 and previous config saved to /var/cache/conftool/dbconfig/20221122-081336-marostegui.json
[08:14:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P40452 and previous config saved to /var/cache/conftool/dbconfig/20221122-081413-ladsgroup.json
[08:15:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (owner: 10BCornwall)
[08:19:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance
[08:19:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance
[08:20:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance
[08:20:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance
[08:20:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance
[08:20:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance
[08:20:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P40453 and previous config saved to /var/cache/conftool/dbconfig/20221122-082057-ladsgroup.json
[08:21:02] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[08:23:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P40454 and previous config saved to /var/cache/conftool/dbconfig/20221122-082314-ladsgroup.json
[08:27:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40455 and previous config saved to /var/cache/conftool/dbconfig/20221122-082746-ladsgroup.json
[08:28:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P40456 and previous config saved to /var/cache/conftool/dbconfig/20221122-082813-marostegui.json
[08:28:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40457 and previous config saved to /var/cache/conftool/dbconfig/20221122-082842-marostegui.json
[08:28:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[08:28:48] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[08:28:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[08:29:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40458 and previous config saved to /var/cache/conftool/dbconfig/20221122-082904-marostegui.json
[08:29:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T323214)', diff saved to https://phabricator.wikimedia.org/P40459 and previous config saved to /var/cache/conftool/dbconfig/20221122-082920-ladsgroup.json
[08:29:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[08:29:25] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[08:29:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[08:30:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T323214)', diff saved to https://phabricator.wikimedia.org/P40460 and previous config saved to /var/cache/conftool/dbconfig/20221122-083003-ladsgroup.json
[08:38:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40461 and previous config saved to /var/cache/conftool/dbconfig/20221122-083820-ladsgroup.json
[08:42:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P40462 and previous config saved to /var/cache/conftool/dbconfig/20221122-084252-ladsgroup.json
[08:43:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P40463 and previous config saved to /var/cache/conftool/dbconfig/20221122-084320-marostegui.json
[08:43:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40464 and previous config saved to /var/cache/conftool/dbconfig/20221122-084326-marostegui.json
[08:43:32] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[08:53:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P40465 and previous config saved to /var/cache/conftool/dbconfig/20221122-085327-ladsgroup.json
[08:57:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40466 and previous config saved to /var/cache/conftool/dbconfig/20221122-085758-ladsgroup.json
[08:58:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[08:58:05] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[08:58:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[08:58:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40467 and previous config saved to /var/cache/conftool/dbconfig/20221122-085820-ladsgroup.json
[08:58:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T321126)', diff saved to https://phabricator.wikimedia.org/P40468 and previous config saved to /var/cache/conftool/dbconfig/20221122-085826-marostegui.json
[08:58:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[08:58:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[08:58:32] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[08:58:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P40469 and previous config saved to /var/cache/conftool/dbconfig/20221122-085832-marostegui.json
[08:58:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T321126)', diff saved to https://phabricator.wikimedia.org/P40470 and previous config saved to /var/cache/conftool/dbconfig/20221122-085843-marostegui.json
[08:59:35] <godog>	 jouncebot: next
[08:59:35] <jouncebot>	 In 5 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400)
[08:59:35] <jouncebot>	 In 5 hour(s) and 0 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400)
[08:59:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: start mirroring traffic to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/858610 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[09:00:24] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet
[09:00:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40471 and previous config saved to /var/cache/conftool/dbconfig/20221122-090030-ladsgroup.json
[09:01:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321126)', diff saved to https://phabricator.wikimedia.org/P40472 and previous config saved to /var/cache/conftool/dbconfig/20221122-090115-marostegui.json
[09:07:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] apertium: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856945 (owner: 10Giuseppe Lavagetto)
[09:08:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P40473 and previous config saved to /var/cache/conftool/dbconfig/20221122-090833-ladsgroup.json
[09:08:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance
[09:08:39] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[09:08:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[09:08:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance
[09:08:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P40474 and previous config saved to /var/cache/conftool/dbconfig/20221122-090854-ladsgroup.json
[09:10:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Implement email address validation workflow - https://phabricator.wikimedia.org/T320808 (10SLyngshede-WMF) 05Open→03In progress
[09:10:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10SLyngshede-WMF)
[09:11:02] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:11:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P40475 and previous config saved to /var/cache/conftool/dbconfig/20221122-091112-ladsgroup.json
[09:11:31] <logmsgbot>	 !log jmm@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cumin2002.codfw.wmnet
[09:12:28] <wikibugs>	 (03Merged) 10jenkins-bot: apertium: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856945 (owner: 10Giuseppe Lavagetto)
[09:12:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1050: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859095 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[09:13:02] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:15] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10jcrespo) a:03Papaul
[09:13:20] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1050.eqiad.wmnet with OS bullseye
[09:13:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1050.eqiad.wmnet with O...
[09:13:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P40476 and previous config saved to /var/cache/conftool/dbconfig/20221122-091339-marostegui.json
[09:15:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:15:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40477 and previous config saved to /var/cache/conftool/dbconfig/20221122-091537-ladsgroup.json
[09:16:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P40478 and previous config saved to /var/cache/conftool/dbconfig/20221122-091621-marostegui.json
[09:16:25] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1049: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859435 (https://phabricator.wikimedia.org/T319184)
[09:16:56] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] node: Exclude trafficserver promfile mtime check (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/858658 (owner: 10BCornwall)
[09:17:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[09:18:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. You may want to collect +1 from Andrew as well to be on the safe side." [puppet] - 10https://gerrit.wikimedia.org/r/859431 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff)
[09:18:45] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply
[09:19:51] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply
[09:20:00] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:20:33] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:21:04] <icinga-wm>	 PROBLEM - SSH on mw1329.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:22:11] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply
[09:22:22] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply
[09:23:50] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply
[09:24:44] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply
[09:25:16] <moritzm>	 !log failover Ganeti master in eqiad to ganeti1028 T311687
[09:25:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:21] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[09:26:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40479 and previous config saved to /var/cache/conftool/dbconfig/20221122-092618-ladsgroup.json
[09:27:28] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage
[09:28:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321130)', diff saved to https://phabricator.wikimedia.org/P40480 and previous config saved to /var/cache/conftool/dbconfig/20221122-092845-marostegui.json
[09:28:51] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[09:30:18] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti1027 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[09:30:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P40481 and previous config saved to /var/cache/conftool/dbconfig/20221122-093044-ladsgroup.json
[09:31:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P40482 and previous config saved to /var/cache/conftool/dbconfig/20221122-093128-marostegui.json
[09:31:56] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage
[09:33:41] <icinga-wm>	 PROBLEM - graphite.wikimedia.org requires authentication on graphite2004 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 200 OK https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[09:34:22] <vgutierrez>	 uh ^?
[09:34:47] <jynus>	 it seems it is not routable from public network, but still not great
[09:35:09] <vgutierrez>	 well.. graphite.wikimedia.org is a public endpoint
[09:35:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[09:35:33] <jynus>	 yeah, but the public endpoint doesn't point there I think
[09:35:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[09:35:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[09:35:46] <jynus>	 I am checking recent commits
[09:35:49] <godog>	 that's me, apologies for the spam
[09:35:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[09:35:52] <godog>	 graphite2004 is a new host
[09:35:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T321130)', diff saved to https://phabricator.wikimedia.org/P40483 and previous config saved to /var/cache/conftool/dbconfig/20221122-093556-marostegui.json
[09:35:58] <godog>	 I'll silence it
[09:36:00] <vgutierrez>	 godog: ack
[09:36:02] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[09:36:36] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on graphite2004.codfw.wmnet with reason: setup
[09:36:42] <godog>	 done ^
[09:36:50] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on graphite2004.codfw.wmnet with reason: setup
[09:36:50] <jynus>	 vgutierrez: what I meant is apache was configured for the public endpoing but it was not reacheble through it
[09:38:53] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1015 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (4131886) = 12.7% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[09:40:08] <jynus>	 ^ moritzm there was an increase in memory utilization, if you are still reimaging those they may need a rebalance afterawards
[09:40:29] <jynus>	 https://grafana.wikimedia.org/goto/w8X-ZbO4k?orgId=1
[09:40:44] <moritzm>	 yeah, that's known, I'm currently reshuffling VMs for reboots
[09:40:55] <jynus>	 no worries then 
[09:41:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P40484 and previous config saved to /var/cache/conftool/dbconfig/20221122-094125-ladsgroup.json
[09:41:45] <jynus>	 hopefully we can move mailman to a dedicated host to free some resources there soon
[09:44:21] <moritzm>	 in general we have enough headroom, it's just temporal spikes during reimages/reboots since the cluster isn't rebalanced after every reboot/reimage, but rather when the entire work is completed
[09:45:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40485 and previous config saved to /var/cache/conftool/dbconfig/20221122-094550-ladsgroup.json
[09:45:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[09:45:56] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[09:46:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[09:46:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P40486 and previous config saved to /var/cache/conftool/dbconfig/20221122-094611-ladsgroup.json
[09:46:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321126)', diff saved to https://phabricator.wikimedia.org/P40487 and previous config saved to /var/cache/conftool/dbconfig/20221122-094635-marostegui.json
[09:46:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[09:46:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[09:46:40] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[09:46:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40488 and previous config saved to /var/cache/conftool/dbconfig/20221122-094645-marostegui.json
[09:47:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet
[09:47:44] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui)
[09:48:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40489 and previous config saved to /var/cache/conftool/dbconfig/20221122-094817-marostegui.json
[09:48:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P40490 and previous config saved to /var/cache/conftool/dbconfig/20221122-094821-ladsgroup.json
[09:48:39] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) Once this host is back we need to make sure we apply {T321130} (enwiki)
[09:48:46] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply
[09:49:36] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply
[09:50:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321130)', diff saved to https://phabricator.wikimedia.org/P40491 and previous config saved to /var/cache/conftool/dbconfig/20221122-095003-marostegui.json
[09:50:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T323214)', diff saved to https://phabricator.wikimedia.org/P40492 and previous config saved to /var/cache/conftool/dbconfig/20221122-095008-ladsgroup.json
[09:50:09] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[09:50:14] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[09:51:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet
[09:56:19] <wikibugs>	 (03CR) 10Jcrespo: "Answer:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/859047 (https://phabricator.wikimedia.org/T323485) (owner: 10Jcrespo)
[09:56:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P40493 and previous config saved to /var/cache/conftool/dbconfig/20221122-095631-ladsgroup.json
[09:56:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[09:56:38] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[09:56:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[09:56:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P40494 and previous config saved to /var/cache/conftool/dbconfig/20221122-095652-ladsgroup.json
[09:58:22] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1050.eqiad.wmnet with OS bullseye
[09:58:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1050.eqiad.wmnet with OS bu...
[09:59:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P40495 and previous config saved to /var/cache/conftool/dbconfig/20221122-095910-ladsgroup.json
[10:01:50] <wikibugs>	 (03PS1) 10Jcrespo: Update changelog for release 1.1 [software/transferpy] - 10https://gerrit.wikimedia.org/r/859446 (https://phabricator.wikimedia.org/T323485)
[10:02:41] <wikibugs>	 (03PS1) 10Hashar: Add CI results to a tab [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083
[10:02:57] <wikibugs>	 (03Abandoned) 10Hashar: Add CI results to a tab [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/856182 (owner: 10Hashar)
[10:03:01] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:03:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add CI results to a tab [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (owner: 10Hashar)
[10:03:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P40496 and previous config saved to /var/cache/conftool/dbconfig/20221122-100323-marostegui.json
[10:03:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40497 and previous config saved to /var/cache/conftool/dbconfig/20221122-100328-ladsgroup.json
[10:05:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P40498 and previous config saved to /var/cache/conftool/dbconfig/20221122-100509-marostegui.json
[10:05:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P40499 and previous config saved to /var/cache/conftool/dbconfig/20221122-100515-ladsgroup.json
[10:06:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] scap: add mw on k8s dsh list [puppet] - 10https://gerrit.wikimedia.org/r/858988 (https://phabricator.wikimedia.org/T323349) (owner: 10Giuseppe Lavagetto)
[10:09:56] <godog>	 !log start backfilling data into graphite2004 - T315524
[10:10:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:02] <stashbot>	 T315524: Put graphite2004 in service - https://phabricator.wikimedia.org/T315524
[10:10:22] <wikibugs>	 10SRE, 10Traffic-Icebox: Let ats-tls handle port 80 - https://phabricator.wikimedia.org/T254235 (10Vgutierrez) 05Open→03Invalid ats-tls has been deprecated in favor of HAProxy
[10:12:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:12:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1002.eqiad.wmnet with reason: ganeti reboot
[10:12:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1002.eqiad.wmnet with reason: ganeti reboot
[10:13:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet
[10:13:38] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950
[10:14:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40500 and previous config saved to /var/cache/conftool/dbconfig/20221122-101417-ladsgroup.json
[10:15:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[10:16:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[10:16:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[10:16:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[10:16:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P40501 and previous config saved to /var/cache/conftool/dbconfig/20221122-101620-ladsgroup.json
[10:18:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P40502 and previous config saved to /var/cache/conftool/dbconfig/20221122-101829-marostegui.json
[10:18:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P40503 and previous config saved to /var/cache/conftool/dbconfig/20221122-101834-ladsgroup.json
[10:18:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet
[10:20:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P40504 and previous config saved to /var/cache/conftool/dbconfig/20221122-102016-marostegui.json
[10:20:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P40505 and previous config saved to /var/cache/conftool/dbconfig/20221122-102021-ladsgroup.json
[10:21:39] <icinga-wm>	 RECOVERY - SSH on mw1329.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:23:55] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:24:55] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:25:28] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[10:25:49] <wikibugs>	 (03PS11) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529)
[10:26:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[10:27:24] <wikibugs>	 (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[10:28:09] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testing k8s deploys
[10:29:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P40506 and previous config saved to /var/cache/conftool/dbconfig/20221122-102923-ladsgroup.json
[10:29:54] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator chart and helmfile configuraiton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[10:30:38] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) >>! In T308677#8290550, @jbond wrote: > just putting a note here.  aft...
[10:31:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858603 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:31:52] <wikibugs>	 (03PS1) 10Vgutierrez: orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720)
[10:33:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40507 and previous config saved to /var/cache/conftool/dbconfig/20221122-103336-marostegui.json
[10:33:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[10:33:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[10:33:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P40508 and previous config saved to /var/cache/conftool/dbconfig/20221122-103341-ladsgroup.json
[10:33:42] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[10:33:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[10:33:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40509 and previous config saved to /var/cache/conftool/dbconfig/20221122-103346-marostegui.json
[10:33:55] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[10:33:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[10:34:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T322618)', diff saved to https://phabricator.wikimedia.org/P40510 and previous config saved to /var/cache/conftool/dbconfig/20221122-103402-ladsgroup.json
[10:34:19] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38375/console" [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[10:34:48] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudvirt: unset_maintenance: clarify SAL message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859451
[10:34:49] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[10:34:50] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[10:35:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321130)', diff saved to https://phabricator.wikimedia.org/P40511 and previous config saved to /var/cache/conftool/dbconfig/20221122-103522-marostegui.json
[10:35:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1107.eqiad.wmnet with reason: Maintenance
[10:35:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T323214)', diff saved to https://phabricator.wikimedia.org/P40512 and previous config saved to /var/cache/conftool/dbconfig/20221122-103527-ladsgroup.json
[10:35:28] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[10:35:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:35:35] <_joe_>	 claime: uhm did we forget to merge the change to helmfile.yaml, did we?
[10:35:37] <stashbot>	 T323214: Fix unsigned drifts in flaggedrevs caused by 4c0b3c7b9b0 - https://phabricator.wikimedia.org/T323214
[10:35:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1107.eqiad.wmnet with reason: Maintenance
[10:35:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[10:35:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T321130)', diff saved to https://phabricator.wikimedia.org/P40513 and previous config saved to /var/cache/conftool/dbconfig/20221122-103544-marostegui.json
[10:35:55] <claime>	 _joe_: which one ? 
[10:36:10] <claime>	 The do not log?
[10:36:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T322618)', diff saved to https://phabricator.wikimedia.org/P40514 and previous config saved to /var/cache/conftool/dbconfig/20221122-103612-ladsgroup.json
[10:36:15] <claime>	 It should have been
[10:36:15] <_joe_>	 claime: yes
[10:36:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40515 and previous config saved to /var/cache/conftool/dbconfig/20221122-103618-marostegui.json
[10:37:03] <claime>	 https://gitlab.wikimedia.org/repos/releng/scap/-/commit/716d9b6cde07d14381c305cfaef9876bdf10ab5b
[10:37:36] <claime>	 _joe_: ^
[10:38:23] <_joe_>	 claime: yeah but you also needed to change all the helmfiles right
[10:38:29] <_joe_>	 else they'd run the hooks
[10:38:51] <claime>	 _joe_: No, calling them with the environment variable set was enough when I tested manually
[10:38:58] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[10:39:09] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[10:40:25] <claime>	 _joe_: Like `SUPPRESS_SAL=true helmfile -e eqiad -i apply` worked, so there may be something I'm missing
[10:40:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] confd: create /var/run/confd-template [puppet] - 10https://gerrit.wikimedia.org/r/859102 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi)
[10:43:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] confd: create /var/run/confd-template [puppet] - 10https://gerrit.wikimedia.org/r/859102 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi)
[10:43:31] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[10:43:31] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[10:43:31] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[10:43:31] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[10:43:31] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[10:43:31] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[10:43:31] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[10:43:32] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[10:43:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[10:43:52] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[10:44:01] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[10:44:02] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[10:44:21] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[10:44:27] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[10:44:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P40516 and previous config saved to /var/cache/conftool/dbconfig/20221122-104429-ladsgroup.json
[10:44:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance
[10:44:32] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[10:44:35] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[10:44:36] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[10:44:42] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[10:44:42] <wikibugs>	 (03PS1) 10Slyngshede: Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453
[10:44:43] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[10:44:43] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[10:44:43] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[10:44:43] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[10:44:43] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[10:44:43] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[10:44:43] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[10:44:44] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[10:44:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance
[10:44:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P40517 and previous config saved to /var/cache/conftool/dbconfig/20221122-104451-ladsgroup.json
[10:44:56] <claime>	 Sorry for the flood.
[10:45:13] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[10:45:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST authorizationpolicies) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:45:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P40518 and previous config saved to /var/cache/conftool/dbconfig/20221122-104534-ladsgroup.json
[10:45:36] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[10:45:38] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[10:45:38] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[10:45:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[10:45:44] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[10:45:44] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[10:46:02] <logmsgbot>	 !log jnuche@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[10:46:03] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[10:46:07] <logmsgbot>	 !log jnuche@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[10:47:04] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez)
[10:47:08] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] orchestrator: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859449 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[10:47:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P40519 and previous config saved to /var/cache/conftool/dbconfig/20221122-104708-ladsgroup.json
[10:49:16] <logmsgbot>	 !log jnuche@deploy1002 Finished scap: testing k8s deploys (duration: 21m 06s)
[10:50:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321130)', diff saved to https://phabricator.wikimedia.org/P40520 and previous config saved to /var/cache/conftool/dbconfig/20221122-105021-marostegui.json
[10:50:27] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[10:50:30] <wikibugs>	 (03PS1) 10Jcrespo: Add man page for tranfer.py executable [software/transferpy] - 10https://gerrit.wikimedia.org/r/859455
[10:50:49] <wikibugs>	 (03PS2) 10Jcrespo: Add man page for transfer.py executable [software/transferpy] - 10https://gerrit.wikimedia.org/r/859455
[10:51:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40521 and previous config saved to /var/cache/conftool/dbconfig/20221122-105118-ladsgroup.json
[10:51:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet
[10:51:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P40522 and previous config saved to /var/cache/conftool/dbconfig/20221122-105124-marostegui.json
[10:55:51] <wikibugs>	 (03PS1) 10Vgutierrez: icinga: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859457 (https://phabricator.wikimedia.org/T238720)
[10:58:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet
[10:58:59] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:59:49] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38376/console" [puppet] - 10https://gerrit.wikimedia.org/r/859457 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[11:00:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P40523 and previous config saved to /var/cache/conftool/dbconfig/20221122-110040-ladsgroup.json
[11:00:58] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/859435 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:01:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1049: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859435 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:01:39] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1049.eqiad.wmnet with OS bullseye
[11:01:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1049.eqiad.wmnet with O...
[11:02:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40524 and previous config saved to /var/cache/conftool/dbconfig/20221122-110214-ladsgroup.json
[11:03:50] <wikibugs>	 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) Bother :-/
[11:04:34] <wikibugs>	 (03CR) 10DCausse: [WIP] flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[11:05:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P40525 and previous config saved to /var/cache/conftool/dbconfig/20221122-110528-marostegui.json
[11:05:29] <logmsgbot>	 !log stevemunene@deploy1002 Started deploy [analytics/turnilo/deploy@51da050]: (no justification provided)
[11:06:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P40526 and previous config saved to /var/cache/conftool/dbconfig/20221122-110625-ladsgroup.json
[11:06:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P40527 and previous config saved to /var/cache/conftool/dbconfig/20221122-110631-marostegui.json
[11:07:41] <logmsgbot>	 !log stevemunene@deploy1002 Finished deploy [analytics/turnilo/deploy@51da050]: (no justification provided) (duration: 02m 12s)
[11:07:55] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114
[11:08:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[11:08:30] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[11:08:41] <wikibugs>	 (03CR) 10Hnowlan: thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[11:08:46] <wikibugs>	 (03CR) 10Jbond: profile::kafka::broker: add pki_intermediate_name parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey)
[11:09:10] <wikibugs>	 (03PS2) 10Hnowlan: thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196)
[11:09:57] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:10:43] <moritzm>	 !log installing gnutls28 security updates
[11:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "ack makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/859112 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[11:11:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859457 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[11:12:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: write out puppet/pki CA certs [puppet] - 10https://gerrit.wikimedia.org/r/859112 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi)
[11:13:33] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:13:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[11:14:14] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: Update docker images to use single model server [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374)
[11:15:35] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage
[11:15:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P40528 and previous config saved to /var/cache/conftool/dbconfig/20221122-111547-ladsgroup.json
[11:15:48] <wikibugs>	 (03PS3) 10Hnowlan: thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196)
[11:16:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet
[11:17:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P40529 and previous config saved to /var/cache/conftool/dbconfig/20221122-111721-ladsgroup.json
[11:18:11] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez)
[11:18:31] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage
[11:20:32] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki-image-download: don't exit 1 if no images to remove [puppet] - 10https://gerrit.wikimedia.org/r/859465
[11:20:34] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap::dsh: support querying puppetdb, use for k8s-workers [puppet] - 10https://gerrit.wikimedia.org/r/859466
[11:20:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P40530 and previous config saved to /var/cache/conftool/dbconfig/20221122-112034-marostegui.json
[11:20:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez)
[11:21:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T322618)', diff saved to https://phabricator.wikimedia.org/P40531 and previous config saved to /var/cache/conftool/dbconfig/20221122-112131-ladsgroup.json
[11:21:37] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[11:21:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40532 and previous config saved to /var/cache/conftool/dbconfig/20221122-112137-marostegui.json
[11:21:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[11:21:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[11:21:43] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[11:22:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38377/console" [puppet] - 10https://gerrit.wikimedia.org/r/859466 (owner: 10Giuseppe Lavagetto)
[11:22:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet
[11:23:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki-image-download: don't exit 1 if no images to remove [puppet] - 10https://gerrit.wikimedia.org/r/859465 (owner: 10Giuseppe Lavagetto)
[11:26:03] <wikibugs>	 (03PS1) 10Vgutierrez: dumps: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859467 (https://phabricator.wikimedia.org/T238720)
[11:28:01] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38378/console" [puppet] - 10https://gerrit.wikimedia.org/r/859467 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[11:28:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[11:28:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[11:28:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[11:28:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[11:28:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[11:28:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P40533 and previous config saved to /var/cache/conftool/dbconfig/20221122-112843-ladsgroup.json
[11:28:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[11:28:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[11:28:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[11:28:50] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[11:28:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T321126)', diff saved to https://phabricator.wikimedia.org/P40534 and previous config saved to /var/cache/conftool/dbconfig/20221122-112856-marostegui.json
[11:29:03] <wikibugs>	 (03PS1) 10Jbond: convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470
[11:29:06] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[11:30:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P40535 and previous config saved to /var/cache/conftool/dbconfig/20221122-113053-ladsgroup.json
[11:30:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T321312)', diff saved to https://phabricator.wikimedia.org/P40536 and previous config saved to /var/cache/conftool/dbconfig/20221122-113053-ladsgroup.json
[11:31:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321126)', diff saved to https://phabricator.wikimedia.org/P40537 and previous config saved to /var/cache/conftool/dbconfig/20221122-113127-marostegui.json
[11:32:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P40538 and previous config saved to /var/cache/conftool/dbconfig/20221122-113227-ladsgroup.json
[11:32:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance
[11:32:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance
[11:32:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P40539 and previous config saved to /var/cache/conftool/dbconfig/20221122-113249-ladsgroup.json
[11:33:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (owner: 10Jbond)
[11:35:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P40540 and previous config saved to /var/cache/conftool/dbconfig/20221122-113506-ladsgroup.json
[11:35:12] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[11:35:17] <wikibugs>	 (03CR) 10AikoChou: ml-services: Update docker images to use single model server (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374) (owner: 10Ilias Sarantopoulos)
[11:35:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321130)', diff saved to https://phabricator.wikimedia.org/P40541 and previous config saved to /var/cache/conftool/dbconfig/20221122-113541-marostegui.json
[11:35:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[11:35:47] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[11:35:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[11:36:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T321130)', diff saved to https://phabricator.wikimedia.org/P40542 and previous config saved to /var/cache/conftool/dbconfig/20221122-113602-marostegui.json
[11:40:11] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: scap::dsh: support querying puppetdb, use for k8s-workers [puppet] - 10https://gerrit.wikimedia.org/r/859466
[11:40:38] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:44:56] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1049.eqiad.wmnet with OS bullseye
[11:44:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:45:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1049.eqiad.wmnet with OS bu...
[11:45:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap::dsh: support querying puppetdb, use for k8s-workers [puppet] - 10https://gerrit.wikimedia.org/r/859466 (owner: 10Giuseppe Lavagetto)
[11:46:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40543 and previous config saved to /var/cache/conftool/dbconfig/20221122-114559-ladsgroup.json
[11:46:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P40544 and previous config saved to /var/cache/conftool/dbconfig/20221122-114634-marostegui.json
[11:49:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321130)', diff saved to https://phabricator.wikimedia.org/P40545 and previous config saved to /var/cache/conftool/dbconfig/20221122-114925-marostegui.json
[11:49:31] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[11:49:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:50:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40546 and previous config saved to /var/cache/conftool/dbconfig/20221122-115012-ladsgroup.json
[11:52:59] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1090 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T318659 - Added more downtime, but replacement batteries are on their way https://wikitech.wikimedia.org/wiki/MegaCli%23
[11:52:59] <icinga-wm>	 ng
[11:53:09] <effie>	 !log MAPS maintenance EQIAD: trigger full planet re-import for maps eqiad
[11:53:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:34] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Patch-For-Review: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) 05Open→03Resolved a:03Joe ` vgutierrez@lvs6001:~$ ./liberica etcd --config /home/vgutierrez/config.yaml  Using config file: /home/vgutier...
[11:56:16] <wikibugs>	 (03PS1) 10Stevemunene: Allow introspection for staging environment. [puppet] - 10https://gerrit.wikimedia.org/r/859479 (https://phabricator.wikimedia.org/T308778)
[11:56:44] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950 (owner: 10Giuseppe Lavagetto)
[11:56:55] <effie>	  !log MAPS maintenance EQIAD: trigger full planet re-import for maps eqiad - T314472
[11:56:56] <stashbot>	 T314472: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472
[11:58:25] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Great idea!" [puppet] - 10https://gerrit.wikimedia.org/r/859467 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[11:58:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: ganeti reboot
[11:58:47] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] dumps: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859467 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[11:58:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: ganeti reboot
[11:59:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: ganeti reboot
[11:59:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd1005.eqiad.wmnet with reason: ganeti reboot
[11:59:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet
[12:01:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P40547 and previous config saved to /var/cache/conftool/dbconfig/20221122-120106-ladsgroup.json
[12:01:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P40548 and previous config saved to /var/cache/conftool/dbconfig/20221122-120140-marostegui.json
[12:02:53] <wikibugs>	 (03CR) 10Reedy: [C: 03+1] build: Update to PHPUnit 9.5 and declare php requirement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858441 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle)
[12:03:02] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:04:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P40549 and previous config saved to /var/cache/conftool/dbconfig/20221122-120431-marostegui.json
[12:04:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet
[12:04:53] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez)
[12:04:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:05:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P40550 and previous config saved to /var/cache/conftool/dbconfig/20221122-120519-ladsgroup.json
[12:08:08] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: fix metrics prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/859106 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:10:34] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[12:10:37] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[12:11:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:14:06] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[12:14:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[12:15:43] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1048: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859481 (https://phabricator.wikimedia.org/T319184)
[12:16:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P40551 and previous config saved to /var/cache/conftool/dbconfig/20221122-121612-ladsgroup.json
[12:16:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[12:16:18] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[12:16:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[12:16:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40552 and previous config saved to /var/cache/conftool/dbconfig/20221122-121633-ladsgroup.json
[12:16:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321126)', diff saved to https://phabricator.wikimedia.org/P40553 and previous config saved to /var/cache/conftool/dbconfig/20221122-121647-marostegui.json
[12:16:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[12:16:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:16:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[12:16:52] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[12:16:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T321126)', diff saved to https://phabricator.wikimedia.org/P40554 and previous config saved to /var/cache/conftool/dbconfig/20221122-121657-marostegui.json
[12:18:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40555 and previous config saved to /var/cache/conftool/dbconfig/20221122-121843-ladsgroup.json
[12:18:49] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: correct haproxy metrics URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/859482
[12:19:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321126)', diff saved to https://phabricator.wikimedia.org/P40556 and previous config saved to /var/cache/conftool/dbconfig/20221122-121928-marostegui.json
[12:19:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P40557 and previous config saved to /var/cache/conftool/dbconfig/20221122-121938-marostegui.json
[12:20:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P40558 and previous config saved to /var/cache/conftool/dbconfig/20221122-122025-ladsgroup.json
[12:20:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance
[12:20:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance
[12:20:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
[12:20:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
[12:21:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P40559 and previous config saved to /var/cache/conftool/dbconfig/20221122-122103-ladsgroup.json
[12:22:33] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1090 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:23:01] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:23:09] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt1048: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859481 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:23:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P40560 and previous config saved to /var/cache/conftool/dbconfig/20221122-122320-ladsgroup.json
[12:23:26] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[12:25:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1048: move to modern NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/859481 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[12:25:34] <wikibugs>	 (03PS5) 10Urbanecm: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno)
[12:25:43] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1048.eqiad.wmnet with OS bullseye
[12:25:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485
[12:25:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486
[12:25:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: citoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859487
[12:25:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cxserver: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859488
[12:25:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudvirt1048.eqiad.wmnet with O...
[12:26:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: datahub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859489
[12:26:05] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: developer-portal: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859490
[12:27:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 (owner: 10Giuseppe Lavagetto)
[12:27:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] proton: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859486 (owner: 10Giuseppe Lavagetto)
[12:27:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] citoid: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859487 (owner: 10Giuseppe Lavagetto)
[12:27:24] <wikibugs>	 (03PS2) 10Jbond: convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470
[12:27:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859488 (owner: 10Giuseppe Lavagetto)
[12:27:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] datahub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859489 (owner: 10Giuseppe Lavagetto)
[12:27:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] developer-portal: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859490 (owner: 10Giuseppe Lavagetto)
[12:28:43] <wikibugs>	 (03PS3) 10Jbond: convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470
[12:29:34] <jnuche>	 jouncebot: nowandnext
[12:29:34] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 30 minute(s)
[12:29:34] <jouncebot>	 In 1 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400)
[12:29:34] <jouncebot>	 In 1 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400)
[12:29:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:30:31] <wikibugs>	 (03PS1) 10Cathal Mooney: New release incorporating changes to the wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/859491 (https://phabricator.wikimedia.org/T312635)
[12:31:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/859479 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene)
[12:33:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40561 and previous config saved to /var/cache/conftool/dbconfig/20221122-123350-ladsgroup.json
[12:33:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] convrt-sssd: simplify logic [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (owner: 10Jbond)
[12:34:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P40562 and previous config saved to /var/cache/conftool/dbconfig/20221122-123435-marostegui.json
[12:34:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321130)', diff saved to https://phabricator.wikimedia.org/P40563 and previous config saved to /var/cache/conftool/dbconfig/20221122-123444-marostegui.json
[12:34:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[12:34:50] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[12:35:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[12:35:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T321130)', diff saved to https://phabricator.wikimedia.org/P40564 and previous config saved to /var/cache/conftool/dbconfig/20221122-123505-marostegui.json
[12:36:49] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.29.1" for 559 hosts
[12:37:20] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.29.1" completed for 559 hosts
[12:38:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40565 and previous config saved to /var/cache/conftool/dbconfig/20221122-123827-ladsgroup.json
[12:39:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:40:08] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage
[12:40:42] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:42:47] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testing k8s deploys
[12:43:42] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage
[12:44:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:45:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:48:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321130)', diff saved to https://phabricator.wikimedia.org/P40567 and previous config saved to /var/cache/conftool/dbconfig/20221122-124818-marostegui.json
[12:48:25] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[12:48:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P40568 and previous config saved to /var/cache/conftool/dbconfig/20221122-124856-ladsgroup.json
[12:49:07] <logmsgbot>	 !log jnuche@deploy1002 Finished scap: testing k8s deploys (duration: 06m 20s)
[12:49:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P40569 and previous config saved to /var/cache/conftool/dbconfig/20221122-124941-marostegui.json
[12:50:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:53:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P40570 and previous config saved to /var/cache/conftool/dbconfig/20221122-125333-ladsgroup.json
[12:57:24] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Encoding issues when handling unicode characters in filenames - https://phabricator.wikimedia.org/T323114 (10hnowlan) 05Open→03Resolved
[12:57:26] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan)
[13:00:01] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: correct haproxy metrics URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/859482 (owner: 10Hnowlan)
[13:01:51] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P40571 and previous config saved to /var/cache/conftool/dbconfig/20221122-130325-marostegui.json
[13:04:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40572 and previous config saved to /var/cache/conftool/dbconfig/20221122-130403-ladsgroup.json
[13:04:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[13:04:10] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[13:04:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[13:04:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[13:04:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[13:04:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P40573 and previous config saved to /var/cache/conftool/dbconfig/20221122-130442-ladsgroup.json
[13:04:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321126)', diff saved to https://phabricator.wikimedia.org/P40574 and previous config saved to /var/cache/conftool/dbconfig/20221122-130447-marostegui.json
[13:04:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[13:04:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[13:04:53] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[13:05:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2101.codfw.wmnet with reason: Maintenance
[13:05:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2101.codfw.wmnet with reason: Maintenance
[13:06:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P40575 and previous config saved to /var/cache/conftool/dbconfig/20221122-130652-ladsgroup.json
[13:06:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2111.codfw.wmnet with reason: Maintenance
[13:06:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2111.codfw.wmnet with reason: Maintenance
[13:07:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T321126)', diff saved to https://phabricator.wikimedia.org/P40576 and previous config saved to /var/cache/conftool/dbconfig/20221122-130701-marostegui.json
[13:07:32] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1048.eqiad.wmnet with OS bullseye
[13:07:37] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: correct haproxy metrics URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/859482 (owner: 10Hnowlan)
[13:07:39] <icinga-wm>	 PROBLEM - SSH on db1122.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:07:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudvirt1048.eqiad.wmnet with OS bu...
[13:08:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[13:08:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T322618)', diff saved to https://phabricator.wikimedia.org/P40577 and previous config saved to /var/cache/conftool/dbconfig/20221122-130840-ladsgroup.json
[13:08:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance
[13:08:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance
[13:09:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40578 and previous config saved to /var/cache/conftool/dbconfig/20221122-130901-ladsgroup.json
[13:09:32] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[13:09:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[13:10:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321126)', diff saved to https://phabricator.wikimedia.org/P40579 and previous config saved to /var/cache/conftool/dbconfig/20221122-131025-marostegui.json
[13:10:31] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[13:11:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40580 and previous config saved to /var/cache/conftool/dbconfig/20221122-131118-ladsgroup.json
[13:11:24] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[13:14:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:18:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P40581 and previous config saved to /var/cache/conftool/dbconfig/20221122-131831-marostegui.json
[13:21:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40582 and previous config saved to /var/cache/conftool/dbconfig/20221122-132158-ladsgroup.json
[13:25:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P40583 and previous config saved to /var/cache/conftool/dbconfig/20221122-132532-marostegui.json
[13:26:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40584 and previous config saved to /var/cache/conftool/dbconfig/20221122-132625-ladsgroup.json
[13:26:40] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Allow introspection for staging environment. [puppet] - 10https://gerrit.wikimedia.org/r/859479 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene)
[13:28:03] <wikibugs>	 (03PS1) 10David Caro: dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720)
[13:28:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[13:28:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720) (owner: 10David Caro)
[13:29:19] <icinga-wm>	 PROBLEM - Host ganeti1012 is DOWN: PING CRITICAL - Packet loss = 100%
[13:30:03] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cookbooks: wmcs: cloudvirt: add cookbook to maintain canary VMs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859114
[13:31:37] <icinga-wm>	 PROBLEM - Host ganeti1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[13:32:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:32:48] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720) (owner: 10David Caro)
[13:33:39] <wikibugs>	 (03PS2) 10David Caro: dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720)
[13:33:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321130)', diff saved to https://phabricator.wikimedia.org/P40585 and previous config saved to /var/cache/conftool/dbconfig/20221122-133339-marostegui.json
[13:33:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[13:33:45] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[13:33:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[13:34:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T321130)', diff saved to https://phabricator.wikimedia.org/P40586 and previous config saved to /var/cache/conftool/dbconfig/20221122-133401-marostegui.json
[13:34:44] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] dumps: fix http alert to check the new status [puppet] - 10https://gerrit.wikimedia.org/r/859498 (https://phabricator.wikimedia.org/T238720) (owner: 10David Caro)
[13:34:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:37:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P40587 and previous config saved to /var/cache/conftool/dbconfig/20221122-133705-ladsgroup.json
[13:37:43] <icinga-wm>	 RECOVERY - Host ganeti1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms
[13:38:25] <wikibugs>	 (03PS1) 10Elukey: team-sre: add druid alerts for webrequest_sampled_live [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981)
[13:40:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P40588 and previous config saved to /var/cache/conftool/dbconfig/20221122-134038-marostegui.json
[13:41:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.wikireplicas.add-wiki
[13:41:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P40589 and previous config saved to /var/cache/conftool/dbconfig/20221122-134131-ladsgroup.json
[13:42:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[13:43:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:45:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM! nit inline, looking good otherwise" [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[13:46:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321130)', diff saved to https://phabricator.wikimedia.org/P40590 and previous config saved to /var/cache/conftool/dbconfig/20221122-134643-marostegui.json
[13:46:49] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[13:48:54] <wikibugs>	 (03PS2) 10Elukey: team-sre: add druid alerts for webrequest_sampled_live [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981)
[13:49:07] <wikibugs>	 (03CR) 10Elukey: team-sre: add druid alerts for webrequest_sampled_live (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[13:52:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P40591 and previous config saved to /var/cache/conftool/dbconfig/20221122-135211-ladsgroup.json
[13:52:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[13:52:17] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[13:52:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[13:52:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P40592 and previous config saved to /var/cache/conftool/dbconfig/20221122-135233-ladsgroup.json
[13:54:34] <wikibugs>	 (03PS2) 10Cathal Mooney: Release v0.6.1 update [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/859491 (https://phabricator.wikimedia.org/T312635)
[13:54:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P40593 and previous config saved to /var/cache/conftool/dbconfig/20221122-135442-ladsgroup.json
[13:55:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321126)', diff saved to https://phabricator.wikimedia.org/P40594 and previous config saved to /var/cache/conftool/dbconfig/20221122-135545-marostegui.json
[13:55:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2123.codfw.wmnet with reason: Maintenance
[13:55:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2123.codfw.wmnet with reason: Maintenance
[13:55:50] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[13:55:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T321126)', diff saved to https://phabricator.wikimedia.org/P40595 and previous config saved to /var/cache/conftool/dbconfig/20221122-135556-marostegui.json
[13:56:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40596 and previous config saved to /var/cache/conftool/dbconfig/20221122-135638-ladsgroup.json
[13:56:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance
[13:56:41] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] icinga: Reject non-tls requests with a 403 [puppet] - 10https://gerrit.wikimedia.org/r/859457 (https://phabricator.wikimedia.org/T238720) (owner: 10Vgutierrez)
[13:56:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance
[13:57:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40597 and previous config saved to /var/cache/conftool/dbconfig/20221122-135659-ladsgroup.json
[13:57:17] <wikibugs>	 (03PS4) 10Jbond: C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677)
[13:57:46] <vgutierrez>	 !log block plain text requests on icinga.wm.o - T238720
[13:57:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:51] <stashbot>	 T238720: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720
[13:58:10] <wikibugs>	 (03PS3) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451
[13:58:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[13:58:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38379/console" [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[13:59:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (owner: 10Jbond)
[13:59:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40598 and previous config saved to /var/cache/conftool/dbconfig/20221122-135917-ladsgroup.json
[13:59:22] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[13:59:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321126)', diff saved to https://phabricator.wikimedia.org/P40599 and previous config saved to /var/cache/conftool/dbconfig/20221122-135926-marostegui.json
[13:59:54] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Aklapper) >>! In T316337#8216814, @jcrespo wrote: > I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish...
[13:59:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1400)
[14:00:11] <wikibugs>	 (03PS5) 10Jbond: C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677)
[14:00:42] <wikibugs>	 (03PS6) 10Jbond: C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677)
[14:00:58] <wikibugs>	 (03PS4) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451
[14:01:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P40600 and previous config saved to /var/cache/conftool/dbconfig/20221122-140150-marostegui.json
[14:01:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (owner: 10Jbond)
[14:03:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [alerts] - 10https://gerrit.wikimedia.org/r/859502 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey)
[14:05:28] <wikibugs>	 (03CR) 10Jbond: C:swift::storage: add variable for data directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:06:24] <logmsgbot>	 !log marostegui@cumin1001 Added views for new wiki: bnwikiquote T319190
[14:06:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0)
[14:06:29] <stashbot>	 T319190: Prepare and check storage layer for bnwikiquote - https://phabricator.wikimedia.org/T319190
[14:06:34] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Vgutierrez) > I am going to do it, but I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish (not just th...
[14:08:43] <wikibugs>	 (03PS5) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (https://phabricator.wikimedia.org/T308677)
[14:08:45] <wikibugs>	 (03PS5) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677)
[14:08:47] <wikibugs>	 (03PS5) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677)
[14:09:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40601 and previous config saved to /var/cache/conftool/dbconfig/20221122-140949-ladsgroup.json
[14:11:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1003.eqiad.wmnet with reason: ganeti reboot
[14:12:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1003.eqiad.wmnet with reason: ganeti reboot
[14:12:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on dse-k8s-etcd1001.eqiad.wmnet with reason: ganeti reboot
[14:12:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dse-k8s-etcd1001.eqiad.wmnet with reason: ganeti reboot
[14:12:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: ganeti reboot
[14:13:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: ganeti reboot
[14:13:05] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) Thanks, that is all I needed to understand the context! I will create a draft doc on Wikitech and link it here for review.
[14:13:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet
[14:14:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40602 and previous config saved to /var/cache/conftool/dbconfig/20221122-141423-ladsgroup.json
[14:14:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P40603 and previous config saved to /var/cache/conftool/dbconfig/20221122-141433-marostegui.json
[14:14:47] <icinga-wm>	 RECOVERY - Host ganeti1012 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[14:15:08] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Vgutierrez)
[14:16:01] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,nic-saturation-exporter.service,prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P40604 and previous config saved to /var/cache/conftool/dbconfig/20221122-141656-marostegui.json
[14:17:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:12] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED
[14:19:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950 (owner: 10Giuseppe Lavagetto)
[14:19:57] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet
[14:22:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudvirt2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:23:23] <_joe_>	 jouncebot: next
[14:23:23] <jouncebot>	 In 2 hour(s) and 36 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1700)
[14:23:55] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:24:28] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950 (owner: 10Giuseppe Lavagetto)
[14:24:54] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:24:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P40605 and previous config saved to /var/cache/conftool/dbconfig/20221122-142455-ladsgroup.json
[14:24:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:26:26] <vgutierrez>	 Emperor: swift_ring_manager errors in thanos-fe1001 are expected?
[14:28:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki
[14:28:11] <vgutierrez>	 Emperor: hmm... Nov 22 14:10:33 thanos-fe1001 swift_ring_manager[3724760]: urllib.error.HTTPError: HTTP Error 401: Unauthorized <-- issued by /usr/bin/swift-dispersion-report
[14:29:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10MoritzMuehlenhoff)
[14:29:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P40606 and previous config saved to /var/cache/conftool/dbconfig/20221122-142930-ladsgroup.json
[14:29:33] <vgutierrez>	 Emperor: so I'm guessing https://thanos-swift.discovery.wmnet/auth/v1.0 is the responsible for that 401
[14:29:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P40607 and previous config saved to /var/cache/conftool/dbconfig/20221122-142939-marostegui.json
[14:32:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321130)', diff saved to https://phabricator.wikimedia.org/P40608 and previous config saved to /var/cache/conftool/dbconfig/20221122-143203-marostegui.json
[14:32:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[14:32:09] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[14:32:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[14:32:22] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[14:32:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T321130)', diff saved to https://phabricator.wikimedia.org/P40609 and previous config saved to /var/cache/conftool/dbconfig/20221122-143224-marostegui.json
[14:32:41] <wikibugs>	 (03CR) 10Jbond: ms-be2050: enable disks by path configuerations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:33:03] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:33:10] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[14:33:18] <wikibugs>	 (03PS6) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677)
[14:33:20] <wikibugs>	 (03PS6) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677)
[14:33:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto)
[14:33:49] <wikibugs>	 (03PS6) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (https://phabricator.wikimedia.org/T308677)
[14:33:59] <wikibugs>	 (03PS7) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677)
[14:34:07] <wikibugs>	 (03PS7) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677)
[14:34:13] <wikibugs>	 (03PS2) 10Ssingh: lvs4009: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/858336 (https://phabricator.wikimedia.org/T317247)
[14:34:33] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[14:34:58] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) I am filling in: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-26_Phabricator_login_issues (Still WIP)
[14:35:20] <wikibugs>	 (03Merged) 10jenkins-bot: Introduce the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749717 (owner: 10Giuseppe Lavagetto)
[14:35:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38380/console" [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:35:37] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[14:36:00] <wikibugs>	 (03PS4) 10Vgutierrez: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[14:36:26] <vgutierrez>	 godog: ^^ merging https://gerrit.wikimedia.org/r/858658  is enough to get it deployed?
[14:37:05] <godog>	 vgutierrez: correct yeah, will be deployed at the next puppet run
[14:37:14] <vgutierrez>	 ack
[14:37:22] <vgutierrez>	 merging it.. we got some noise already
[14:37:54] <vgutierrez>	 (as soon as jenkins-bot is happy with the current PS)
[14:38:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:38:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:38:47] <godog>	 *nod*
[14:38:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:38:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:39:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:39:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs4009: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/858336 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[14:39:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:39:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:39:23] <sukhe>	 jbond: ok to merge your changes? :)
[14:39:27] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[14:39:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:39:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:40:02] <jbond>	 sukhe: yes please
[14:40:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P40610 and previous config saved to /var/cache/conftool/dbconfig/20221122-144002-ladsgroup.json
[14:40:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[14:40:05] <sukhe>	 done!
[14:40:08] <jbond>	 thanks
[14:40:08] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[14:40:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[14:40:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P40611 and previous config saved to /var/cache/conftool/dbconfig/20221122-144023-ladsgroup.json
[14:40:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:40:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:40:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[14:41:13] <vgutierrez>	 wonderful
[14:41:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:41:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:41:29] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[14:41:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4009.ulsfo.wmnet with OS buster
[14:42:11] <vgutierrez>	 node.yaml: 5:15: group "node_exporter", rule 1, "NodeTextfileStale": could not parse expression: 1:44: parse error: unknown escape sequence U+002E '.'
[14:42:15] <vgutierrez>	 sigh
[14:42:30] <wikibugs>	 (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Release v0.6.1 update [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/859491 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[14:42:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P40612 and previous config saved to /var/cache/conftool/dbconfig/20221122-144232-ladsgroup.json
[14:43:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply config changes - bking@cumin2002 - T319020
[14:43:44] <stashbot>	 T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020
[14:43:46] <wikibugs>	 (03PS5) 10Vgutierrez: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[14:44:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P40613 and previous config saved to /var/cache/conftool/dbconfig/20221122-144436-ladsgroup.json
[14:44:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance
[14:44:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321126)', diff saved to https://phabricator.wikimedia.org/P40614 and previous config saved to /var/cache/conftool/dbconfig/20221122-144446-marostegui.json
[14:44:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2128.codfw.wmnet with reason: Maintenance
[14:44:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2128.codfw.wmnet with reason: Maintenance
[14:44:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance
[14:44:52] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[14:44:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance
[14:44:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance
[14:44:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P40615 and previous config saved to /var/cache/conftool/dbconfig/20221122-144458-ladsgroup.json
[14:44:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:45:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T321126)', diff saved to https://phabricator.wikimedia.org/P40616 and previous config saved to /var/cache/conftool/dbconfig/20221122-144507-marostegui.json
[14:45:17] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[14:45:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321130)', diff saved to https://phabricator.wikimedia.org/P40617 and previous config saved to /var/cache/conftool/dbconfig/20221122-144519-marostegui.json
[14:45:20] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED
[14:45:27] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[14:45:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[14:45:36] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 - cmooney@cumin1001
[14:45:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED
[14:47:14] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 - cmooney@cumin1001
[14:47:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P40618 and previous config saved to /var/cache/conftool/dbconfig/20221122-144715-ladsgroup.json
[14:48:13] <icinga-wm>	 PROBLEM - Check systemd state on dbstore1007 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321126)', diff saved to https://phabricator.wikimedia.org/P40619 and previous config saved to /var/cache/conftool/dbconfig/20221122-144833-marostegui.json
[14:48:37] <Emperor>	 vgutierrez: the occasional failure isn't the end of the world (it runs hourly); those auth failures are related to the frontends being loaded; I'm starting to wonder if we should think about more capacity there as well as ms-
[14:48:59] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testing k8s deploys
[14:51:38] <vgutierrez>	 Emperor: ack
[14:53:24] <logmsgbot>	 !log btullis@cumin1001 Added views for new wiki: tlwikiquote T317111
[14:53:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0)
[14:53:30] <stashbot>	 T317111: Prepare and check storage layer for tlwikiquote - https://phabricator.wikimedia.org/T317111
[14:53:33] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:11] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: Update docker images to use single model server (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374) (owner: 10Ilias Sarantopoulos)
[14:54:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:55:06] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: Update docker images to use single model server [deployment-charts] - 10https://gerrit.wikimedia.org/r/859461 (https://phabricator.wikimedia.org/T320374)
[14:55:07] <logmsgbot>	 !log jnuche@deploy1002 Finished scap: testing k8s deploys (duration: 06m 08s)
[14:55:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:55:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[14:55:49] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[14:56:41] <logmsgbot>	 !log oblivian@deploy1002 Started scap: Adding clusterconfig
[14:57:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40620 and previous config saved to /var/cache/conftool/dbconfig/20221122-145738-ladsgroup.json
[14:57:47] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[14:58:34] <vgutierrez>	 godog: node.yaml: 5:15: group "node_exporter", rule 1, "NodeTextfileStale": could not parse expression: 1:44: parse error: unknown escape sequence U+002E '.' --> any idea on how to properly escape a dot (.) in a regex on the alerts repo?
[15:00:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P40621 and previous config saved to /var/cache/conftool/dbconfig/20221122-150025-marostegui.json
[15:00:58] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: Adding clusterconfig (duration: 04m 17s)
[15:02:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40622 and previous config saved to /var/cache/conftool/dbconfig/20221122-150221-ladsgroup.json
[15:03:12] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage
[15:03:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P40623 and previous config saved to /var/cache/conftool/dbconfig/20221122-150339-marostegui.json
[15:06:32] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[15:06:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4009.ulsfo.wmnet with reason: host reimage
[15:07:28] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:07:32] <icinga-wm>	 RECOVERY - SSH on db1122.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:09:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:11:21] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED
[15:11:39] <wikibugs>	 (03PS1) 10Jforrester: [Beta Cluster] Point Wikifunctions mobile links to the right place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859539 (https://phabricator.wikimedia.org/T314891)
[15:12:26] <James_F>	 jouncebot: now
[15:12:26] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 47 minute(s)
[15:12:34] <James_F>	 'K, will sling out a Beta-only patch.
[15:12:39] <jinxer-wm>	 (NodeTextfileStale) resolved: Stale textfile for cloudvirt2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:12:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P40624 and previous config saved to /var/cache/conftool/dbconfig/20221122-151245-ladsgroup.json
[15:13:03] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Point Wikifunctions mobile links to the right place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859539 (https://phabricator.wikimedia.org/T314891) (owner: 10Jforrester)
[15:13:32] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED
[15:13:56] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta Cluster] Point Wikifunctions mobile links to the right place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859539 (https://phabricator.wikimedia.org/T314891) (owner: 10Jforrester)
[15:14:00] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[15:14:42] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:04] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:15:07] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485
[15:15:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Fixes for conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/859540
[15:15:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541
[15:15:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P40625 and previous config saved to /var/cache/conftool/dbconfig/20221122-151532-marostegui.json
[15:15:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fixes for conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/859540 (owner: 10Giuseppe Lavagetto)
[15:15:48] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1012.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs10
[15:15:48] <icinga-wm>	 .wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1007.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:16:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485 (owner: 10Giuseppe Lavagetto)
[15:16:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541 (owner: 10Giuseppe Lavagetto)
[15:16:18] <jinxer-wm>	 (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:16:21] <gehel>	 ryankemper / inflatador ^^
[15:16:28] <inflatador>	 gehel :eyes
[15:16:30] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:16:30] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:16:52] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs10
[15:16:52] <icinga-wm>	 .wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:16:57] <gehel>	 inflatador: are there any ongoing work on wdqs / eqiad?
[15:17:04] <inflatador>	 No
[15:17:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P40626 and previous config saved to /var/cache/conftool/dbconfig/20221122-151728-ladsgroup.json
[15:17:44] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:17:45] <dcausse>	 surge in load, thread counts exploded
[15:18:00] <wikibugs>	 (03CR) 10Stang: "To deployer: this patch requires a maint script run, please read T323378#8413476" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378) (owner: 10Stang)
[15:18:08] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10calbon) a:03calbon
[15:18:18] <jinxer-wm>	 (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:18:35] * akosiaris around
[15:18:37] <akosiaris>	 acking page
[15:18:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P40627 and previous config saved to /var/cache/conftool/dbconfig/20221122-151846-marostegui.json
[15:18:48] <jelto>	 also around, thanks alex
[15:18:51] <inflatador>	 akosiaris is that page for wdqs?
[15:18:56] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:18:57] <akosiaris>	 inflatador: yes
[15:19:03] <gehel>	 so probably related to specific queries? a bot abusing the service?
[15:19:14] <inflatador>	 akosiaris we are looking into it now if that helps
[15:19:19] <herron>	 around as well
[15:19:28] <akosiaris>	 inflatador: cool, thanks for letting us know
[15:19:47] <akosiaris>	 gehel: I see a 429 Too Many requests alert, so you are probably right
[15:19:51] <dcausse>	 yes I suspect a bot with a bad query, let's try to restart all the nodes in eqiad
[15:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:20:00] <wikibugs>	 (03Merged) 10jenkins-bot: Fixes for conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/859540 (owner: 10Giuseppe Lavagetto)
[15:20:12] <inflatador>	 dcausse cool, will get started on that immediately
[15:20:16] <gehel>	 WDQS is known to be somewhat unstable, and we're not shooting for a 99% availability. So no need to have everyone on deck at the moment
[15:20:18] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:20:20] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[15:20:20] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.121 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:20:37] <akosiaris>	 gehel: cool, thanks for that info
[15:21:05] <gehel>	 inflatador: I'm assuming that you're on it and you'll scream for help as needed?
[15:21:18] <jinxer-wm>	 (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:21:26] <akosiaris>	 do you need an IC?
[15:21:42] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:21:43] <akosiaris>	 or does it seem simple enough with no need for more coordination?
[15:22:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.restart
[15:22:21] <wikibugs>	 (03PS1) 10Jbond: wmcs - sso: update ogin url to idp-dev [puppet] - 10https://gerrit.wikimedia.org/r/859542
[15:22:37] <gehel>	 akosiaris: Let's see if it recovers after a restart of the services. If that's not the case, it's going to be more problematic and might need an IC.
[15:22:48] <inflatador>	 akosiaris ^^ what gehel said
[15:23:12] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:23:14] <akosiaris>	 cool
[15:23:18] <jinxer-wm>	 (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:23:39] <jelto>	 ok, I'll stand by and watch what happens after restart. The page recovered fyi
[15:23:54] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:24:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:24:44] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[15:25:03] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4009.ulsfo.wmnet with OS buster
[15:26:03] <wikibugs>	 (03PS1) 10JMeybohm: pontoon: Add .crt filename suffix to PKI root CA [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163)
[15:27:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[15:27:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P40628 and previous config saved to /var/cache/conftool/dbconfig/20221122-152751-ladsgroup.json
[15:27:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[15:27:58] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[15:28:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[15:28:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P40629 and previous config saved to /var/cache/conftool/dbconfig/20221122-152813-ladsgroup.json
[15:30:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P40630 and previous config saved to /var/cache/conftool/dbconfig/20221122-153023-ladsgroup.json
[15:30:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321130)', diff saved to https://phabricator.wikimedia.org/P40631 and previous config saved to /var/cache/conftool/dbconfig/20221122-153038-marostegui.json
[15:30:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[15:30:44] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[15:30:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[15:31:13] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Allow accessing NewImpact module in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526)
[15:31:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.restart
[15:31:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Unify routing-intstance config across JunOS devices [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[15:32:01] <wikibugs>	 (03Merged) 10jenkins-bot: Unify routing-intstance config across JunOS devices [homer/public] - 10https://gerrit.wikimedia.org/r/857598 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[15:32:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Doh! thank you, LGTM (please see inline too)" [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) (owner: 10JMeybohm)
[15:32:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T322618)', diff saved to https://phabricator.wikimedia.org/P40632 and previous config saved to /var/cache/conftool/dbconfig/20221122-153235-ladsgroup.json
[15:33:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321126)', diff saved to https://phabricator.wikimedia.org/P40633 and previous config saved to /var/cache/conftool/dbconfig/20221122-153352-marostegui.json
[15:33:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance
[15:33:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance
[15:33:58] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[15:34:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40634 and previous config saved to /var/cache/conftool/dbconfig/20221122-153403-marostegui.json
[15:34:07] <wikibugs>	 (03PS2) 10JMeybohm: pontoon: Add .crt filename suffix to PKI root CA [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163)
[15:34:42] <wikibugs>	 (03CR) 10JMeybohm: pontoon: Add .crt filename suffix to PKI root CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) (owner: 10JMeybohm)
[15:34:58] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[15:34:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:36:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/859543 (https://phabricator.wikimedia.org/T319163) (owner: 10JMeybohm)
[15:37:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40635 and previous config saved to /var/cache/conftool/dbconfig/20221122-153727-marostegui.json
[15:37:34] <moritzm>	 !log upgrading mwdebug2002 to 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1
[15:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:39] <moritzm>	 !log upgrading mwdebug2002 to PHP 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1
[15:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:48] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:38:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs - sso: update ogin url to idp-dev [puppet] - 10https://gerrit.wikimedia.org/r/859542 (owner: 10Jbond)
[15:39:01] <topranks>	 !log updating route-distinguisher for cloud vrf on cloud switches eqiad
[15:39:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:57] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: move webperf jobs to 'ext' instance [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087)
[15:41:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[15:41:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[15:41:22] <wikibugs>	 (03PS2) 10Kosta Harlan: GrowthExperiments: Allow accessing NewImpact module in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526)
[15:41:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T321130)', diff saved to https://phabricator.wikimedia.org/P40636 and previous config saved to /var/cache/conftool/dbconfig/20221122-154127-marostegui.json
[15:41:33] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[15:43:22] <moritzm>	 !log importing php7.4 7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf10u1 to apt.wikimedia.org T323358
[15:43:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40637 and previous config saved to /var/cache/conftool/dbconfig/20221122-154530-ladsgroup.json
[15:45:32] <wikibugs>	 (03PS3) 10Cathal Mooney: Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529)
[15:45:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38381/console" [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi)
[15:45:37] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[15:46:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add section for PIC config of QFX5120-48Y port block speeds [homer/public] - 10https://gerrit.wikimedia.org/r/840105 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[15:49:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:50:11] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi)
[15:50:32] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] hiera: add graphite2004 to codfw graphite queries [puppet] - 10https://gerrit.wikimedia.org/r/858611 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi)
[15:50:37] <wikibugs>	 (03PS6) 10Vgutierrez: node: Exclude trafficserver promfile mtime check [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[15:50:43] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: remove unused chart/project image-suggestion-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/859541
[15:50:45] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: image-suggestion: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/859485
[15:50:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add conversion for ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/859567
[15:51:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: move webperf jobs to 'ext' instance [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi)
[15:51:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:51:51] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Run refreshUserImpactData maintenance script in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541)
[15:52:15] <wikibugs>	 (03PS2) 10Kosta Harlan: GrowthExperiments: Run refreshUserImpactData maintenance script in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541)
[15:52:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P40638 and previous config saved to /var/cache/conftool/dbconfig/20221122-155234-marostegui.json
[15:52:39] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] "Don't deploy until Id6eac58bd0ab36c02136486114010739bccc1ba1 is in group2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan)
[15:54:25] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert)
[15:55:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321130)', diff saved to https://phabricator.wikimedia.org/P40639 and previous config saved to /var/cache/conftool/dbconfig/20221122-155523-marostegui.json
[15:55:29] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[15:55:54] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[15:57:12] <claime>	 !log T323621 Add IPs for mw-web.svc and mw-api-ext.svc
[15:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:18] <stashbot>	 T323621: Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621
[15:58:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[15:58:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[15:58:41] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/859065 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[15:59:07] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:00:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P40640 and previous config saved to /var/cache/conftool/dbconfig/20221122-160036-ladsgroup.json
[16:00:45] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Make mw-web and mw-api-ext available behind LVS - https://phabricator.wikimedia.org/T323621 (10Clement_Goubert) 05Open→03In progress
[16:01:01] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert)
[16:02:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:02:18] <moritzm>	 !log drain ganeti1027 for eventual reimage to Bullseye T311687
[16:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:23] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[16:03:53] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans)
[16:04:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:04:19] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans)
[16:07:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P40641 and previous config saved to /var/cache/conftool/dbconfig/20221122-160740-marostegui.json
[16:08:38] <wikibugs>	 (03Merged) 10jenkins-bot: sessionstore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/857711 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans)
[16:09:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:10:05] <wikibugs>	 (03PS1) 10Clément Goubert: wmnet: Add mw-web, mw-api-ext [dns] - 10https://gerrit.wikimedia.org/r/859571 (https://phabricator.wikimedia.org/T323621)
[16:10:15] <wikibugs>	 (03CR) 10Elukey: Add a spark-operator chart and helmfile configuraiton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[16:10:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P40642 and previous config saved to /var/cache/conftool/dbconfig/20221122-161029-marostegui.json
[16:10:48] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[16:10:52] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1011 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (30718) = 25.4% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[16:10:59] <wikibugs>	 (03PS2) 10Bernard Wang: Update TOC to use PinnableHeader [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson)
[16:11:21] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[16:11:38] <wikibugs>	 (03PS3) 10Bernard Wang: Update TOC to use PinnableHeader [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson)
[16:12:51] <wikibugs>	 (03PS1) 10Bernard Wang: Fix icon button spacing in sticky header [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859508 (https://phabricator.wikimedia.org/T323176)
[16:15:42] <wikibugs>	 (03PS1) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621)
[16:15:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P40643 and previous config saved to /var/cache/conftool/dbconfig/20221122-161542-ladsgroup.json
[16:15:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[16:15:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[16:15:49] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[16:16:30] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:18] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration managment tooling - https://phabricator.wikimedia.org/T321874 (10Joe) >>! In T321874#8405699, @bking wrote: >  >> I don't think there is a productive and actionable outcome of the discussion in this task, nor that we've made progress in...
[16:19:26] <wikibugs>	 (03PS8) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677)
[16:19:30] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator chart and helmfile configuraiton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[16:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:21:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[16:22:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40644 and previous config saved to /var/cache/conftool/dbconfig/20221122-162247-marostegui.json
[16:22:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2157.codfw.wmnet with reason: Maintenance
[16:22:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2157.codfw.wmnet with reason: Maintenance
[16:22:53] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[16:22:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T321126)', diff saved to https://phabricator.wikimedia.org/P40645 and previous config saved to /var/cache/conftool/dbconfig/20221122-162257-marostegui.json
[16:24:33] <wikibugs>	 (03PS2) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621)
[16:24:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:25:34] <wikibugs>	 (03PS1) 10Jbond: P:swift::configure_disks: remove ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/859573
[16:25:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P40646 and previous config saved to /var/cache/conftool/dbconfig/20221122-162536-marostegui.json
[16:25:53] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38387/console" [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[16:26:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321126)', diff saved to https://phabricator.wikimedia.org/P40647 and previous config saved to /var/cache/conftool/dbconfig/20221122-162621-marostegui.json
[16:27:46] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: apply
[16:28:38] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply
[16:29:15] <wikibugs>	 (03PS3) 10Clément Goubert: service::catalog: Add mw-web and mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621)
[16:29:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:swift::configure_disks: remove ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/859573 (owner: 10Jbond)
[16:32:35] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Add new graphite hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/859575 (https://phabricator.wikimedia.org/T315524)
[16:35:26] <wikibugs>	 (03PS1) 10Cathal Mooney: Modify Homer config to ignore port speed warnings [puppet] - 10https://gerrit.wikimedia.org/r/859576 (https://phabricator.wikimedia.org/T303529)
[16:35:41] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38388/console" [puppet] - 10https://gerrit.wikimedia.org/r/859572 (https://phabricator.wikimedia.org/T323621) (owner: 10Clément Goubert)
[16:39:02] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/859576 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[16:39:27] <wikibugs>	 (03CR) 10Krinkle: prometheus: move webperf jobs to 'ext' instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi)
[16:39:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:40:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321130)', diff saved to https://phabricator.wikimedia.org/P40648 and previous config saved to /var/cache/conftool/dbconfig/20221122-164042-marostegui.json
[16:40:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[16:40:49] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[16:40:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[16:41:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T321130)', diff saved to https://phabricator.wikimedia.org/P40649 and previous config saved to /var/cache/conftool/dbconfig/20221122-164104-marostegui.json
[16:41:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P40650 and previous config saved to /var/cache/conftool/dbconfig/20221122-164128-marostegui.json
[16:42:51] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify Homer config to ignore port speed warnings [puppet] - 10https://gerrit.wikimedia.org/r/859576 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[16:44:15] <wikibugs>	 (03PS5) 10Dzahn: rsync::quickdatacopy: Allow having multiple destination hosts [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[16:44:25] <wikibugs>	 (03PS13) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[16:45:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[16:47:38] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki
[16:48:32] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply
[16:49:22] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply
[16:51:15] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration management tooling - https://phabricator.wikimedia.org/T321874 (10Aklapper)
[16:51:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: [DNM] remove old graphite hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/859579 (https://phabricator.wikimedia.org/T315524)
[16:52:04] <wikibugs>	 (03PS6) 10Cathal Mooney: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635)
[16:52:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[16:53:10] <wikibugs>	 (03Merged) 10jenkins-bot: Add OSPF automation template for EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/857482 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[16:53:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: move webperf jobs to 'ext' instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859566 (https://phabricator.wikimedia.org/T175087) (owner: 10Filippo Giunchedi)
[16:53:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321130)', diff saved to https://phabricator.wikimedia.org/P40651 and previous config saved to /var/cache/conftool/dbconfig/20221122-165354-marostegui.json
[16:54:00] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[16:55:48] <wikibugs>	 (03PS1) 10Jbond: swift::mount_filesystem: allow overriding the mount point [puppet] - 10https://gerrit.wikimedia.org/r/859581 (https://phabricator.wikimedia.org/T308677)
[16:56:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P40652 and previous config saved to /var/cache/conftool/dbconfig/20221122-165634-marostegui.json
[16:57:34] <wikibugs>	 (03PS14) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[16:58:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] swift::mount_filesystem: allow overriding the mount point [puppet] - 10https://gerrit.wikimedia.org/r/859581 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[16:58:19] <wikibugs>	 (03PS1) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the            buildpack/tekton images in the local repo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188)
[16:58:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis)
[16:58:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] node: Exclude trafficserver promfile mtime check (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/858658 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[16:59:11] <wikibugs>	 (03PS2) 10Stang: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858717 (https://phabricator.wikimedia.org/T323627)
[17:00:04] <jouncebot>	 jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:03:16] <wikibugs>	 (03CR) 10Cathal Mooney: Add function to int_automation to validate QFX5120 port blocks (036 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney)
[17:03:40] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:04:40] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:08:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[17:09:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P40653 and previous config saved to /var/cache/conftool/dbconfig/20221122-170900-marostegui.json
[17:09:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1006.eqiad.wmnet to drbd
[17:09:40] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:11:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321126)', diff saved to https://phabricator.wikimedia.org/P40654 and previous config saved to /var/cache/conftool/dbconfig/20221122-171141-marostegui.json
[17:11:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance
[17:11:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance
[17:11:47] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[17:11:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40655 and previous config saved to /var/cache/conftool/dbconfig/20221122-171151-marostegui.json
[17:12:32] <logmsgbot>	 !log btullis@cumin1001 Added views for new wiki: bclwikiquote T316456
[17:12:32] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0)
[17:12:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P40656 and previous config saved to /var/cache/conftool/dbconfig/20221122-171235-ladsgroup.json
[17:12:37] <stashbot>	 T316456: Prepare and check storage layer for bclwikiquote - https://phabricator.wikimedia.org/T316456
[17:13:24] <wikibugs>	 (03CR) 10Urbanecm: GrowthExperiments: Allow accessing NewImpact module in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan)
[17:13:40] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:15:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40657 and previous config saved to /var/cache/conftool/dbconfig/20221122-171519-marostegui.json
[17:15:52] <wikibugs>	 (03PS1) 10Jbond: swift: Allow for mounting using the device directly [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677)
[17:16:40] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Consider alternative configuration management tooling - https://phabricator.wikimedia.org/T321874 (10bking) >>! In T321874#8413774, @Joe wrote: >>>! In T321874#8405699, @bking wrote: >>  >>> I don't think there is a productive and actionable outcome of the discussion in...
[17:17:33] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply config changes - bking@cumin2002 - T319020
[17:17:38] <stashbot>	 T319020: Reset to upstream java GC options and remove redundant JVM options - https://phabricator.wikimedia.org/T319020
[17:18:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38391/console" [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[17:18:40] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:18:55] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:19:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1006.eqiad.wmnet to drbd
[17:19:40] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:21:32] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs4009 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/859065 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[17:22:22] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[17:23:40] <jinxer-wm>	 (NodeTextfileStale) firing: (40) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:24:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P40658 and previous config saved to /var/cache/conftool/dbconfig/20221122-172407-marostegui.json
[17:25:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1006.eqiad.wmnet to plain
[17:25:57] <wikibugs>	 10SRE, 10observability, 10Epic, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942 (10Aklapper)
[17:26:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1006.eqiad.wmnet to plain
[17:27:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P40659 and previous config saved to /var/cache/conftool/dbconfig/20221122-172740-ladsgroup.json
[17:28:25] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: felix: Instruct felix to set the src parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586
[17:28:40] <jinxer-wm>	 (NodeTextfileStale) resolved: (32) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:28:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38392/console" [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[17:29:40] <jinxer-wm>	 (NodeTextfileStale) resolved: (32) Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:29:40] <wikibugs>	 (03PS15) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926)
[17:29:57] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38393/console" [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[17:30:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P40660 and previous config saved to /var/cache/conftool/dbconfig/20221122-173025-marostegui.json
[17:30:43] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki
[17:31:05] <wikibugs>	 (03PS2) 10Jbond: swift: Allow for mounting using the device directly [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677)
[17:31:18] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: felix: Instruct felix to set the src parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586
[17:32:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38394/console" [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[17:33:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] swift: Allow for mounting using the device directly [puppet] - 10https://gerrit.wikimedia.org/r/859584 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[17:34:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:38:11] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye
[17:38:19] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye
[17:39:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321130)', diff saved to https://phabricator.wikimedia.org/P40661 and previous config saved to /var/cache/conftool/dbconfig/20221122-173913-marostegui.json
[17:39:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[17:39:19] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[17:39:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[17:42:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P40662 and previous config saved to /var/cache/conftool/dbconfig/20221122-174245-ladsgroup.json
[17:45:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P40663 and previous config saved to /var/cache/conftool/dbconfig/20221122-174532-marostegui.json
[17:45:53] <logmsgbot>	 !log btullis@cumin2002 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[17:47:10] <icinga-wm>	 PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 6 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[17:48:17] <wikibugs>	 (03PS1) 10Ladsgroup: mediawiki: Reduce the frequency of flaggedrevs updates [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495)
[17:48:20] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:50:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[17:50:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[17:51:56] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[17:54:12] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:54:36] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:54:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:55:48] <logmsgbot>	 !log btullis@cumin1001 Added views for new wiki: igwikiquote T314639
[17:55:48] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0)
[17:55:54] <stashbot>	 T314639: Prepare and check storage layer for igwikiquote - https://phabricator.wikimedia.org/T314639
[17:56:39] <logmsgbot>	 !log btullis@cumin2002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[17:57:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P40664 and previous config saved to /var/cache/conftool/dbconfig/20221122-175750-ladsgroup.json
[17:59:41] <wikibugs>	 (03PS1) 10Jbond: swift: move ms-be2050 to new naming schema [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677)
[18:00:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321126)', diff saved to https://phabricator.wikimedia.org/P40665 and previous config saved to /var/cache/conftool/dbconfig/20221122-180038-marostegui.json
[18:00:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2178.codfw.wmnet with reason: Maintenance
[18:00:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2178.codfw.wmnet with reason: Maintenance
[18:00:45] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[18:00:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T321126)', diff saved to https://phabricator.wikimedia.org/P40666 and previous config saved to /var/cache/conftool/dbconfig/20221122-180049-marostegui.json
[18:00:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[18:00:54] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "self -1 as not sure of the consequences" [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[18:01:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[18:01:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T321130)', diff saved to https://phabricator.wikimedia.org/P40667 and previous config saved to /var/cache/conftool/dbconfig/20221122-180109-marostegui.json
[18:01:15] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[18:04:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321126)', diff saved to https://phabricator.wikimedia.org/P40668 and previous config saved to /var/cache/conftool/dbconfig/20221122-180412-marostegui.json
[18:07:46] <icinga-wm>	 PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:09:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:11:10] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:12:21] <wikibugs>	 (03PS3) 10AOkoth: vrts: add error checking [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059)
[18:13:19] <wikibugs>	 (03CR) 10AOkoth: vrts: add error checking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[18:13:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321130)', diff saved to https://phabricator.wikimedia.org/P40669 and previous config saved to /var/cache/conftool/dbconfig/20221122-181351-marostegui.json
[18:13:58] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[18:14:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:18:04] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:19:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] vrts: add error checking [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[18:19:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P40670 and previous config saved to /var/cache/conftool/dbconfig/20221122-181919-marostegui.json
[18:28:49] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder)
[18:28:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P40671 and previous config saved to /var/cache/conftool/dbconfig/20221122-182857-marostegui.json
[18:30:00] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:22] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:48] <moritzm>	 !log installing pcre2 security updates
[18:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:03] <logmsgbot>	 !log brett@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2041.codfw.wmnet with OS bullseye
[18:34:10] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**)   - Downtimed on Ic...
[18:34:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P40672 and previous config saved to /var/cache/conftool/dbconfig/20221122-183428-marostegui.json
[18:38:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for pcre2 [puppet] - 10https://gerrit.wikimedia.org/r/859594
[18:39:56] <icinga-wm>	 RECOVERY - Check systemd state on dbstore1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:44:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P40673 and previous config saved to /var/cache/conftool/dbconfig/20221122-184404-marostegui.json
[18:44:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:46:29] <sukhe>	 !log cr[34]-ulsfo: set routing-options static route 198.35.26.112/28 next-hop 10.128.0.9: T317247
[18:46:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:35] <stashbot>	 T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247
[18:47:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for pcre2 [puppet] - 10https://gerrit.wikimedia.org/r/859594 (owner: 10Muehlenhoff)
[18:48:27] <sukhe>	 !log decommissioning lvs4006: T317247
[18:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:38] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs4006: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/859086 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[18:48:57] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs4006.ulsfo.wmnet with reason: downtimed, in the process of decom
[18:49:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs4006.ulsfo.wmnet with reason: downtimed, in the process of decom
[18:49:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321126)', diff saved to https://phabricator.wikimedia.org/P40674 and previous config saved to /var/cache/conftool/dbconfig/20221122-184934-marostegui.json
[18:49:40] <stashbot>	 T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126
[18:51:28] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:52:12] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:52:18] <sukhe>	 ^ expected
[18:52:47] <wikibugs>	 (03PS2) 10Muehlenhoff: webperf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858605 (https://phabricator.wikimedia.org/T308013)
[18:56:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] webperf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858605 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[18:58:30] <wikibugs>	 (03PS1) 10Ssingh: lvs4009: set as high-traffic2 primary LVS and remove lvs4006 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/859598 (https://phabricator.wikimedia.org/T317247)
[18:59:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321130)', diff saved to https://phabricator.wikimedia.org/P40675 and previous config saved to /var/cache/conftool/dbconfig/20221122-185910-marostegui.json
[18:59:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[18:59:17] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[18:59:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] webperf: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858605 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[18:59:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[18:59:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T321130)', diff saved to https://phabricator.wikimedia.org/P40676 and previous config saved to /var/cache/conftool/dbconfig/20221122-185943-marostegui.json
[18:59:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:00:33] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs4006 [homer/public] - 10https://gerrit.wikimedia.org/r/859600 (https://phabricator.wikimedia.org/T317247)
[19:01:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Host lvs1014.mgmt is down - https://phabricator.wikimedia.org/T322933 (10wiki_willy) a:03Jclark-ctr
[19:02:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10wiki_willy) a:03Jclark-ctr
[19:04:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[19:07:40] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[19:08:42] <icinga-wm>	 RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:09:36] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:13:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs4006.ulsfo.wmnet
[19:13:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321130)', diff saved to https://phabricator.wikimedia.org/P40677 and previous config saved to /var/cache/conftool/dbconfig/20221122-191337-marostegui.json
[19:13:43] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[19:17:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[19:18:53] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye
[19:19:00] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye
[19:19:42] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:19:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs4006.ulsfo.wmnet
[19:19:51] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs4006.ulsfo.wmnet` - lvs4006.ulsfo.wmnet (**WARN**)   - D...
[19:21:19] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs4009: set as high-traffic2 primary LVS and remove lvs4006 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/859598 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[19:21:42] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:22:24] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[19:22:58] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] "Forgive me for merging this without review but it's a removal of a host that was decommissioned and it will alert otherwise!" [homer/public] - 10https://gerrit.wikimedia.org/r/859600 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[19:24:08] <wikibugs>	 (03Merged) 10jenkins-bot: sites.yaml: remove decommissioned host lvs4006 [homer/public] - 10https://gerrit.wikimedia.org/r/859600 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[19:24:16] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder)
[19:24:51] <sukhe>	 !log running homer for Gerrit 859600: lvs4006 decommission
[19:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:04] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye
[19:28:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**)   - Removed from Pu...
[19:28:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P40678 and previous config saved to /var/cache/conftool/dbconfig/20221122-192844-marostegui.json
[19:32:42] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:42:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet']
[19:42:47] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet']
[19:43:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P40679 and previous config saved to /var/cache/conftool/dbconfig/20221122-194350-marostegui.json
[19:44:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:46:21] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet']
[19:46:28] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet']
[19:47:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet']
[19:47:30] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet']
[19:47:36] <sukhe>	 hmmm
[19:49:24] <wikibugs>	 (03PS1) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677)
[19:49:55] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: add error checking [puppet] - 10https://gerrit.wikimedia.org/r/858716 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth)
[19:50:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[19:50:31] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet']
[19:50:38] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet']
[19:50:42] <wikibugs>	 (03PS2) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677)
[19:50:57] <wikibugs>	 (03PS3) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677)
[19:51:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[19:51:40] <sukhe>	 HTTPS interface it is then for now :)
[19:51:46] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[19:51:52] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[19:53:51] <logmsgbot>	 !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp2041.codfw.wmnet with OS bullseye
[19:54:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye
[19:54:20] <wikibugs>	 (03CR) 10Jbond: install_server: Add dynamic raid configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[19:55:16] <wikibugs>	 (03PS4) 10Jbond: install_server: Add dynamic raid configuration [puppet] - 10https://gerrit.wikimedia.org/r/859607 (https://phabricator.wikimedia.org/T308677)
[19:58:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321130)', diff saved to https://phabricator.wikimedia.org/P40680 and previous config saved to /var/cache/conftool/dbconfig/20221122-195857-marostegui.json
[19:58:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[19:59:03] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[19:59:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041']
[19:59:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[19:59:24] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041']
[19:59:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T321130)', diff saved to https://phabricator.wikimedia.org/P40681 and previous config saved to /var/cache/conftool/dbconfig/20221122-195929-marostegui.json
[19:59:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Attempt to move some GPUs from Hadoop to the DSE-K8S cluster - https://phabricator.wikimedia.org/T318696 (10wiki_willy) a:03Jclark-ctr
[20:02:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10wiki_willy) Hi @MoritzMuehlenhoff - thanks for the heads up on IRC.  @Papaul will be taking a look at the host, to wrap up the installation by the end of the week.  Thanks, Willy
[20:03:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041']
[20:03:31] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041']
[20:03:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet']
[20:03:53] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet']
[20:04:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041.codfw.wmnet']
[20:04:24] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041.codfw.wmnet']
[20:04:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041']
[20:04:47] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041']
[20:04:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041']
[20:04:55] <logmsgbot>	 !log brett@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2041.codfw.wmnet with OS bullseye
[20:05:00] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041']
[20:05:05] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp2041.codfw.wmnet with OS bullseye executed with errors: - cp2041 (**FAIL**)   - Removed from Pu...
[20:05:34] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp2041']
[20:05:41] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp2041']
[20:07:22] <sukhe>	 !log sudo ipmitool -I lanplus -H "cp2041.mgmt.codfw.wmnet" -U root -E chassis power cycle
[20:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321130)', diff saved to https://phabricator.wikimedia.org/P40682 and previous config saved to /var/cache/conftool/dbconfig/20221122-201140-marostegui.json
[20:11:46] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[20:16:26] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:19:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[20:19:50] <wikibugs>	 (03PS1) 10Stevemunene: Allow introspection for production environment [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778)
[20:19:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:20:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Allow introspection for production environment [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene)
[20:21:58] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:23:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host puppetdb1003.mgmt.eqiad.wmnet with reboot policy FORCED
[20:25:49] <wikibugs>	 (03PS2) 10Stevemunene: Allow introspection for production environment [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778)
[20:26:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P40683 and previous config saved to /var/cache/conftool/dbconfig/20221122-202646-marostegui.json
[20:32:48] <bwang>	 hello! just to note i have 2 patches for the deployment window in 30 min, but i have to step away for the next hour, so i will be back 30 min after the deployment window starts
[20:33:23] <bwang>	 sorry, i hope its not too inconvenient to the deployer! s
[20:36:46] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetdb1003.mgmt.eqiad.wmnet with reboot policy FORCED
[20:41:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P40684 and previous config saved to /var/cache/conftool/dbconfig/20221122-204153-marostegui.json
[20:48:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['puppetdb1003']
[20:52:30] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[20:54:53] <wikibugs>	 (03PS3) 10Samtar: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858717 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang)
[20:55:02] <wikibugs>	 (03PS3) 10Samtar: zhwiki: Install PageTriage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378) (owner: 10Stang)
[20:57:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321130)', diff saved to https://phabricator.wikimedia.org/P40685 and previous config saved to /var/cache/conftool/dbconfig/20221122-205659-marostegui.json
[20:57:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[20:57:06] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[20:57:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[20:57:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T321130)', diff saved to https://phabricator.wikimedia.org/P40686 and previous config saved to /var/cache/conftool/dbconfig/20221122-205720-marostegui.json
[20:57:42] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['puppetdb1003']
[20:58:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul)
[20:58:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['puppetdb1003']
[20:58:55] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Great! Thanks Steve." [puppet] - 10https://gerrit.wikimedia.org/r/859610 (https://phabricator.wikimedia.org/T308778) (owner: 10Stevemunene)
[20:59:05] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 4 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10XenoRyet)
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221122T2100).
[21:00:04] <jouncebot>	 bwang and cirno: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:14] <TheresNoTime>	 I can deploy :)
[21:00:31] <cirno>	 o/
[21:00:58] <TheresNoTime>	 cirno: I'm going to start with your 858717, then run the maintenance script for 858705
[21:01:11] <cirno>	 well please wait
[21:01:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858717 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang)
[21:01:33] <TheresNoTime>	 cirno: stop? ^
[21:01:44] <cirno>	 I removed the logo one as maybe put it for a while is better
[21:01:50] <logmsgbot>	 !log samtar@deploy1002 backport aborted:  (duration: 00m 33s)
[21:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update favicon and CentralAuthLoginIcon for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858717 (https://phabricator.wikimedia.org/T323627) (owner: 10Stang)
[21:02:35] <cirno>	 just refresh the latest calendar :)
[21:02:51] <cirno>	 so maybe revert this one? not sure what to do
[21:02:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED
[21:02:57] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED
[21:03:10] <TheresNoTime>	 ack
[21:03:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED
[21:03:32] <wikibugs>	 (03PS1) 10Samtar: Revert "Update favicon and CentralAuthLoginIcon for wikifunctionswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859509
[21:03:57] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "Reverting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859509 (owner: 10Samtar)
[21:04:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['puppetdb1003']
[21:04:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update favicon and CentralAuthLoginIcon for wikifunctionswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859509 (owner: 10Samtar)
[21:04:57] <TheresNoTime>	 cirno: okay, reverted :) 
[21:05:26] <TheresNoTime>	 I'll try the maintenance script for 858705 now
[21:05:38] <cirno>	 thanks, left a message on the relevant task  :)
[21:06:50] <TheresNoTime>	 cirno: that seems to have worked as expected :)
[21:07:05] <cirno>	 TheresNoTime: do beta cluster support wikimediadebug or not? So I can access mwdebug1001 during the deploy
[21:07:17] <TheresNoTime>	 cirno: it does not afaik
[21:07:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378) (owner: 10Stang)
[21:08:06] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Install PageTriage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/858705 (https://phabricator.wikimedia.org/T323378) (owner: 10Stang)
[21:08:40] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[21:10:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321130)', diff saved to https://phabricator.wikimedia.org/P40687 and previous config saved to /var/cache/conftool/dbconfig/20221122-211049-marostegui.json
[21:10:55] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[21:10:56] <wikibugs>	 (03PS1) 10Stang: Revert "Revert "Update favicon and CentralAuthLoginIcon for wikifunctionswiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627)
[21:11:24] <wikibugs>	 (03PS2) 10Stang: Revert "Revert "Update favicon and CentralAuthLoginIcon for wikifunctionswiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859510 (https://phabricator.wikimedia.org/T323627)
[21:11:47] <TheresNoTime>	 cirno: just waiting for `beta-code-update-eqiad` to finish, then that'll hopefully be live on beta
[21:11:54] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:12:30] <TheresNoTime>	 s/`beta-code-update-eqiad`/`beta-scap-sync-world`
[21:12:58] <cjming>	 TheresNoTime: hi! thanks for deploying -- bwang mentioned to me earlier today that he'll be available about halfway thru this backport window - he should be around in 15
[21:13:42] <TheresNoTime>	 cjming: no worries, they also left a message on the calendar which I saw :)
[21:13:55] <TheresNoTime>	 cirno: that's live on beta, and looking at https://zh.wikipedia.beta.wmflabs.org/wiki/Special:%E7%89%88%E6%9C%AC it seems to be enabled at least?
[21:14:21] <cirno>	 looking
[21:16:18] <TheresNoTime>	 https://zh.wikipedia.beta.wmflabs.org/wiki/Special:%E6%96%B0%E9%A1%B5%E9%9D%A2%E4%BE%9B%E7%BB%99 loads, not sure if no results is expected to be honest..
[21:16:23] <cirno>	 yeah I could see PageTriage appeared in Special:version, but there's something weird like https://zh.wikipedia.beta.wmflabs.org/wiki/Special:%E6%96%B0%E9%A1%B5%E9%9D%A2%E4%BE%9B%E7%BB%99 contains nothing... is it something expected 
[21:17:14] <cirno>	 I'm creating a new article with alt account to test
[21:20:22] <cirno>	 I created a new page called https://zh.wikipedia.beta.wmflabs.org/wiki/12345, and the pagetriage tool on the right hand side appears, so LGTM!
[21:20:52] <TheresNoTime>	 ack, noting T323647
[21:20:52] <stashbot>	 T323647: PHP Notice: Undefined index: afc_state - https://phabricator.wikimedia.org/T323647
[21:22:18] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[21:23:11] <TheresNoTime>	 guess that's somewhat expected
[21:25:54] <TheresNoTime>	 bwang: lemme know when you're about for your patches :)
[21:25:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P40688 and previous config saved to /var/cache/conftool/dbconfig/20221122-212556-marostegui.json
[21:29:23] <wikibugs>	 10SRE, 10observability, 10serviceops, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10jijiki) >>! In T224454#8411988, @elukey wrote: > An optional (but in my opinion useful) alert could be related to a prolonged usage of the gutte...
[21:30:00] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "starting deploy" [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859508 (https://phabricator.wikimedia.org/T323176) (owner: 10Bernard Wang)
[21:31:47] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "starting deploy" [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson)
[21:32:31] <bwang>	 TheresNoTime: hi i'm back and ready!
[21:32:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov1004.mgmt.eqiad.wmnet with reboot policy FORCED
[21:33:01] <TheresNoTime>	 bwang: hi! I've just started off the patches merging, seeing as they take ~10 minutes
[21:33:25] <bwang>	 gotcha, just lmk!
[21:33:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1004']
[21:35:53] <wikibugs>	 10SRE, 10serviceops, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Aklapper)
[21:41:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P40689 and previous config saved to /var/cache/conftool/dbconfig/20221122-214103-marostegui.json
[21:43:53] <wikibugs>	 (03Merged) 10jenkins-bot: Fix icon button spacing in sticky header [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859508 (https://phabricator.wikimedia.org/T323176) (owner: 10Bernard Wang)
[21:44:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859508 (https://phabricator.wikimedia.org/T323176) (owner: 10Bernard Wang)
[21:44:17] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:859508|Fix icon button spacing in sticky header (T323176)]]
[21:44:18] <TheresNoTime>	 bwang: starting with 859508 :)
[21:44:22] <stashbot>	 T323176: [S] Sticky header icon buttons are missing padding - https://phabricator.wikimedia.org/T323176
[21:44:39] <logmsgbot>	 !log samtar@deploy1002 samtar and bwang: Backport for [[gerrit:859508|Fix icon button spacing in sticky header (T323176)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:44:51] <TheresNoTime>	 bwang: that's live on mwdebug now, can you test?
[21:45:14] <bwang>	 yep, but which one?
[21:45:29] <wikibugs>	 (03Merged) 10jenkins-bot: Update TOC to use PinnableHeader [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson)
[21:45:31] <TheresNoTime>	 use mwdebug1001 :)
[21:45:31] <bwang>	 or does it not matter which number
[21:45:38] <TheresNoTime>	 (doesn't matter afaik)
[21:47:25] <bwang>	 great the first patch looks good
[21:47:33] <TheresNoTime>	 syncing that patch now
[21:48:59] <bwang>	 is the second one also ready to test?
[21:49:34] <TheresNoTime>	 bwang: not yet, be about ~5 minutes :)
[21:49:46] <bwang>	 👍
[21:49:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:51:42] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:859508|Fix icon button spacing in sticky header (T323176)]] (duration: 07m 25s)
[21:51:47] <stashbot>	 T323176: [S] Sticky header icon buttons are missing padding - https://phabricator.wikimedia.org/T323176
[21:51:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/859076 (https://phabricator.wikimedia.org/T317897) (owner: 10Jdlrobson)
[21:52:03] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:859076|Update TOC to use PinnableHeader (T317897)]]
[21:52:08] <stashbot>	 T317897: [L] [Page Tools] Make the page tools menu pinnable - https://phabricator.wikimedia.org/T317897
[21:52:12] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[21:52:24] <logmsgbot>	 !log samtar@deploy1002 samtar and jdlrobson: Backport for [[gerrit:859076|Update TOC to use PinnableHeader (T317897)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[21:52:27] <TheresNoTime>	 bwang: 859508 should be live everywhere now, and 859076 is available to test on mwdebug1001
[21:54:00] <bwang>	 859076 looks good too! 
[21:54:07] <TheresNoTime>	 syncin'!
[21:56:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321130)', diff saved to https://phabricator.wikimedia.org/P40690 and previous config saved to /var/cache/conftool/dbconfig/20221122-215610-marostegui.json
[21:56:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[21:56:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[21:56:15] <bwang>	 thanks!!
[21:56:16] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[21:58:15] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:859076|Update TOC to use PinnableHeader (T317897)]] (duration: 06m 11s)
[21:58:18] <TheresNoTime>	 np! both patches should be live everywhere now :)
[21:58:20] <stashbot>	 T317897: [L] [Page Tools] Make the page tools menu pinnable - https://phabricator.wikimedia.org/T317897
[21:58:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbprov1004']
[21:59:56] <TheresNoTime>	 !log close UTC late backport window
[22:00:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance
[22:06:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance
[22:14:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:15:23] <mutante>	 !log phab1004 - rsyncing /srv/repos from phab1001 with 2Mbit bwlimit
[22:15:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:03] <mutante>	 !log phab1004 - rsyncing /srv/repos from phab1001 with 2Mbit bwlimit - pulling - rsync -avp --bwlimit=2m --delete rsync://phab1001.eqiad.wmnet/srv-repos/ /srv/repos/ -  T280597 
[22:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:09] <stashbot>	 T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597
[22:17:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul)
[22:19:16] <wikibugs>	 10SRE, 10ops-codfw: Troubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10Papaul) 05Open→03Resolved This is now fixed by @jbond and @Volans
[22:19:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2102.codfw.wmnet with reason: Maintenance
[22:19:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2102.codfw.wmnet with reason: Maintenance
[22:22:08] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1023 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (2179388) = 23.0% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[22:24:03] <mutante>	 !log temp disabling puppet on 17 hosts using rsync::quickdatacopy to carefully deploy gerrit:715636 allowing multiple dest hosts for syncing
[22:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Papaul)
[22:30:05] <wikibugs>	 (03PS1) 10Papaul: Add puppetdb1003 and dbprov1004 to site.pp and netboox.cfg [puppet] - 10https://gerrit.wikimedia.org/r/859625 (https://phabricator.wikimedia.org/T317892)
[22:30:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2103.codfw.wmnet with reason: Maintenance
[22:30:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1004']
[22:30:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2103.codfw.wmnet with reason: Maintenance
[22:30:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T321130)', diff saved to https://phabricator.wikimedia.org/P40691 and previous config saved to /var/cache/conftool/dbconfig/20221122-223047-marostegui.json
[22:30:53] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[22:30:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbprov1004']
[22:31:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1004']
[22:32:24] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add puppetdb1003 and dbprov1004 to site.pp and netboox.cfg [puppet] - 10https://gerrit.wikimedia.org/r/859625 (https://phabricator.wikimedia.org/T317892) (owner: 10Papaul)
[22:34:31] <mutante>	 !log phabricator: on phab1001 user 'phd' is UID 497, on pahb1004 user 'phd' is UID 920 (this is desired and a fix!) - but also..because uid 497 was now free.. it became the UID of user 'vcs' on phab1004 while on phab1001 user 'vcs' is uid 498. so we use "find /srv/repos -uid 497 -exec chown phd {} \;" to give files owned by 497 to phd. T280597
[22:34:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:36] <stashbot>	 T280597: move phabricator to new hardware generation - https://phabricator.wikimedia.org/T280597
[22:36:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host puppetdb1003.eqiad.wmnet with OS bullseye
[22:36:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bullseye
[22:37:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul)
[22:37:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10Papaul)
[22:37:58] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbprov1004']
[22:38:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov1004']
[22:39:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:43:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T321130)', diff saved to https://phabricator.wikimedia.org/P40692 and previous config saved to /var/cache/conftool/dbconfig/20221122-224321-marostegui.json
[22:43:28] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[22:44:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:48:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage
[22:52:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetdb1003.eqiad.wmnet with reason: host reimage
[22:54:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:58:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P40693 and previous config saved to /var/cache/conftool/dbconfig/20221122-225828-marostegui.json
[22:59:22] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbprov1004']
[22:59:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:02:42] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "compiled on all 17 hosts that use this (list to paste into compiler from cumin command: sudo cumin --no-colors 'R:rsync::quickdatacopy' 2>" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[23:06:54] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetdb1003.eqiad.wmnet with OS bullseye
[23:07:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bullseye completed: - puppetdb1003 (**PASS**)...
[23:11:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul)
[23:11:30] <wikibugs>	 (03CR) 10Dzahn: "change for multiple dest hosts - merged and deployed - unblocking you" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm)
[23:12:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff this complete
[23:13:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P40694 and previous config saved to /var/cache/conftool/dbconfig/20221122-231334-marostegui.json
[23:13:42] <wikibugs>	 (03PS2) 10Dzahn: phabricator: set mysql master port for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/859145 (https://phabricator.wikimedia.org/T280597)
[23:16:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov1004.eqiad.wmnet with OS bullseye
[23:17:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov1004.eqiad.wmnet with OS bullseye
[23:17:38] <wikibugs>	 (03PS1) 10Dzahn: phabricator: let phd run on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/859628 (https://phabricator.wikimedia.org/T280597)
[23:24:01] <wikibugs>	 (03PS1) 10Dzahn: phabricator: move some more settings from host file to common [puppet] - 10https://gerrit.wikimedia.org/r/859631 (https://phabricator.wikimedia.org/T280597)
[23:28:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T321130)', diff saved to https://phabricator.wikimedia.org/P40695 and previous config saved to /var/cache/conftool/dbconfig/20221122-232841-marostegui.json
[23:28:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2116.codfw.wmnet with reason: Maintenance
[23:28:47] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[23:28:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2116.codfw.wmnet with reason: Maintenance
[23:29:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T321130)', diff saved to https://phabricator.wikimedia.org/P40696 and previous config saved to /var/cache/conftool/dbconfig/20221122-232903-marostegui.json
[23:41:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321130)', diff saved to https://phabricator.wikimedia.org/P40697 and previous config saved to /var/cache/conftool/dbconfig/20221122-234134-marostegui.json
[23:41:41] <stashbot>	 T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130
[23:44:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:50:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1004.eqiad.wmnet with reason: host reimage
[23:53:21] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Create list of users who can test the CampaignEvents extension on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859634 (https://phabricator.wikimedia.org/T316227)
[23:53:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1004.eqiad.wmnet with reason: host reimage
[23:54:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:56:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P40698 and previous config saved to /var/cache/conftool/dbconfig/20221122-235641-marostegui.json
[23:57:01] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Configure the CampaignEvents ext to use the x1.wikishared db for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859635 (https://phabricator.wikimedia.org/T322745)
[23:58:16] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859636 (https://phabricator.wikimedia.org/T322745)