[00:00:48] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.161 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:01:45] <wikibugs>	 (03PS1) 10Dzahn: scap: move firewall rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/862378 (https://phabricator.wikimedia.org/T114209)
[00:02:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] scap: move firewall rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/862378 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn)
[00:04:36] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[00:04:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P41969 and previous config saved to /var/cache/conftool/dbconfig/20221201-000458-ladsgroup.json
[00:07:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1206.eqiad.wmnet with reason: host reimage
[00:07:53] <wikibugs>	 10SRE: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274 (10Dzahn) >>! In T119274#1861105, @Reedy wrote: > This was filed for T93531  Hey Reedy, let me respond 7 years later.  You can still search for it https://www.google.co.uk/search?q=site:secure.wikimedia.org but T93531 has b...
[00:08:24] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:10:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1206.eqiad.wmnet with reason: host reimage
[00:14:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T322618)', diff saved to https://phabricator.wikimedia.org/P41970 and previous config saved to /var/cache/conftool/dbconfig/20221201-001427-ladsgroup.json
[00:14:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance
[00:14:35] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[00:14:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance
[00:14:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T322618)', diff saved to https://phabricator.wikimedia.org/P41971 and previous config saved to /var/cache/conftool/dbconfig/20221201-001449-ladsgroup.json
[00:17:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T322618)', diff saved to https://phabricator.wikimedia.org/P41972 and previous config saved to /var/cache/conftool/dbconfig/20221201-001659-ladsgroup.json
[00:19:09] <wikibugs>	 (03PS1) 10Andrew Bogott: oslo_messaging_rabbit: increase retry and backoff by a lot [puppet] - 10https://gerrit.wikimedia.org/r/862389 (https://phabricator.wikimedia.org/T318816)
[00:19:55] <wikibugs>	 (03PS2) 10Ssingh: cp5026: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861914 (https://phabricator.wikimedia.org/T322048)
[00:20:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P41973 and previous config saved to /var/cache/conftool/dbconfig/20221201-002005-ladsgroup.json
[00:21:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] oslo_messaging_rabbit: increase retry and backoff by a lot [puppet] - 10https://gerrit.wikimedia.org/r/862389 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott)
[00:22:29] <wikibugs>	 10SRE, 10serviceops-collab, 10serviceops-radar: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274 (10Dzahn)
[00:23:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1206.eqiad.wmnet with OS bullseye
[00:23:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1206.eqiad.wmnet with OS bullseye completed: - db1206 (**PASS**)   - Removed from Puppet and Puppe...
[00:23:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Papaul)
[00:24:09] <wikibugs>	 10SRE, 10serviceops-collab, 10serviceops-radar: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274 (10Dzahn) I went to https://superset.wikimedia.org and then tried the dashboard "Webrequest Sampled 128 | SRE" that @volans just showed us in an SRE presentation.  I filtered by...
[00:24:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Papaul) 05Open→03Resolved @Marostegui this is complete
[00:24:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp5026: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861914 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[00:25:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[00:25:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5026.eqsin.wmnet with OS buster
[00:25:46] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS buster
[00:26:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:26:30] <wikibugs>	 10SRE, 10serviceops-collab, 10serviceops-radar: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274 (10Dzahn) 05Open→03Resolved a:03Dzahn hits on `secure.wikimedia.org` from "1 month ago" until "now".  {F35826054}  Biggest referer is MIT by the way.  That being said, I th...
[00:26:33] <wikibugs>	 10SRE, 10SEO: secure.wikimedia.org entries still showing up in Google search results - https://phabricator.wikimedia.org/T93531 (10Dzahn)
[00:26:35] <wikibugs>	 10SRE, 10Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790 (10Dzahn)
[00:27:38] <wikibugs>	 10SRE, 10SEO: secure.wikimedia.org entries still showing up in Google search results - https://phabricator.wikimedia.org/T93531 (10Dzahn) T119274#8434032
[00:29:06] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846 (10Dzahn)
[00:29:35] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846 (10Dzahn) I think we should either decline this OR redirect to gitlab OR to gerrit, just definitely not to Phabricator anymore.
[00:30:41] <wikibugs>	 10SRE, 10Deployments, 10Infrastructure-Foundations, 10serviceops-radar: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585 (10Dzahn)
[00:31:03] <wikibugs>	 10SRE, 10Deployments, 10Infrastructure-Foundations, 10serviceops-radar: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585 (10Dzahn) Probably this means it should be created with `systemd::sysuser` in puppet nowadays.
[00:32:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P41974 and previous config saved to /var/cache/conftool/dbconfig/20221201-003205-ladsgroup.json
[00:32:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610 (10Dzahn)
[00:34:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610 (10Dzahn) This is old but I think it can be translated to "create all system users with systemd::sysuser in puppet" nowadays.
[00:35:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T322618)', diff saved to https://phabricator.wikimedia.org/P41975 and previous config saved to /var/cache/conftool/dbconfig/20221201-003511-ladsgroup.json
[00:35:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[00:35:19] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[00:35:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[00:35:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T322618)', diff saved to https://phabricator.wikimedia.org/P41976 and previous config saved to /var/cache/conftool/dbconfig/20221201-003533-ladsgroup.json
[00:39:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T322618)', diff saved to https://phabricator.wikimedia.org/P41977 and previous config saved to /var/cache/conftool/dbconfig/20221201-003941-ladsgroup.json
[00:40:24] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ProtocolError(Connection aborted., ConnectionResetError(104, Connection reset by peer)): /en.wikipedia.org/v1/page/featured/2016/04/29 https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:40:50] <wikibugs>	 (03PS2) 10Ssingh: cp5027: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861915 (https://phabricator.wikimedia.org/T322048)
[00:42:00] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:42:24] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Documentation, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10Dzahn) It's probably unrealistic to see this ticket closed as resolved ever.   We could close it and I would be fine with that or we can...
[00:43:44] <wikibugs>	 10SRE, 10Diffusion, 10Release-Engineering-Team, 10serviceops-collab: svn.wikimedia.org redirects to Diffusion main page, hence hard to find e.g. "flexbisonparse" - https://phabricator.wikimedia.org/T140594 (10Dzahn)
[00:44:12] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Wikimedia-Planet, 10serviceops-collab, 10Patch-For-Review: mixed-content issues on planet.wikimedia.org - https://phabricator.wikimedia.org/T141480 (10Dzahn)
[00:45:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Release-Engineering-Team: Enforce reference to Phabricator task for all commits to modules/admin/data/data.yaml - https://phabricator.wikimedia.org/T142827 (10Dzahn)
[00:47:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P41978 and previous config saved to /var/cache/conftool/dbconfig/20221201-004712-ladsgroup.json
[00:48:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:48:50] <wikibugs>	 10SRE, 10Cloud-Services, 10Domains, 10Education-Program-Dashboard, 10Traffic-Icebox: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332 (10Dzahn)
[00:50:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[00:50:31] <wikibugs>	 (03PS4) 10Eevans: Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis)
[00:50:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:51:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis)
[00:51:15] <wikibugs>	 10SRE, 10Cloud-Services, 10Domains, 10Education-Program-Dashboard, 10Traffic-Icebox: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332 (10Dzahn) status here as of today is:  https://dash.wmflabs.org/ exists but shows an error because no proxy is configured  h...
[00:51:50] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1396 days) https://wikitech.wikimedia.org/wiki/Logs
[00:52:11] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:52:46] <wikibugs>	 10SRE, 10Cloud-Services, 10Domains, 10Education-Program-Dashboard, 10Traffic-Icebox: Create short link for outreachdashboard.wmflabs.org - https://phabricator.wikimedia.org/T146332 (10Dzahn) I added Cloud-Service because it seems to me this needs an admin from https://openstack-browser.toolforge.org/proj...
[00:53:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage
[00:53:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:54:36] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2014 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:54:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P41979 and previous config saved to /var/cache/conftool/dbconfig/20221201-005447-ladsgroup.json
[00:55:20] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.108 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:55:25] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846 (10bd808) >>! In T119846#8434042, @Dzahn wrote: > I think we should either decline this OR redirect to gitlab OR to gerrit, j...
[00:55:42] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:55:55] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10Dzahn) This is yet another one where I would call it resolved once the user is created by systemd::sysuser.
[00:56:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5026.eqsin.wmnet with reason: host reimage
[00:56:53] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10Dzahn)
[00:57:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[00:59:42] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:00:44] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve2008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[01:01:24] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2008 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:02:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T322618)', diff saved to https://phabricator.wikimedia.org/P41980 and previous config saved to /var/cache/conftool/dbconfig/20221201-010219-ladsgroup.json
[01:02:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance
[01:02:27] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[01:02:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance
[01:02:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41981 and previous config saved to /var/cache/conftool/dbconfig/20221201-010240-ladsgroup.json
[01:04:16] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:04:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41982 and previous config saved to /var/cache/conftool/dbconfig/20221201-010450-ladsgroup.json
[01:08:48] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.136 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:09:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P41983 and previous config saved to /var/cache/conftool/dbconfig/20221201-010954-ladsgroup.json
[01:11:28] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:11:34] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:16:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:17:52] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.150 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:18:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:18:22] <icinga-wm>	 PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - free space: / 3538 MB (3% inode=80%): /tmp 3538 MB (3% inode=80%): /var/tmp 3538 MB (3% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops
[01:19:26] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:19:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P41984 and previous config saved to /var/cache/conftool/dbconfig/20221201-011957-ladsgroup.json
[01:20:06] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:21:54] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:24:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5026.eqsin.wmnet with OS buster
[01:25:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T322618)', diff saved to https://phabricator.wikimedia.org/P41985 and previous config saved to /var/cache/conftool/dbconfig/20221201-012500-ladsgroup.json
[01:25:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[01:25:05] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5026.eqsin.wmnet with OS buster completed: - cp5026 (**PASS**)   -...
[01:25:09] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[01:25:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[01:25:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T322618)', diff saved to https://phabricator.wikimedia.org/P41986 and previous config saved to /var/cache/conftool/dbconfig/20221201-012522-ladsgroup.json
[01:26:10] <wikibugs>	 (03CR) 10Wugapodes: Add ContactPage and ArbCom form to EnWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860946 (https://phabricator.wikimedia.org/T321447) (owner: 10Wugapodes)
[01:26:18] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] cp5027: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861915 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[01:26:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T322618)', diff saved to https://phabricator.wikimedia.org/P41987 and previous config saved to /var/cache/conftool/dbconfig/20221201-012630-ladsgroup.json
[01:27:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS buster
[01:27:11] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster
[01:27:50] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.107 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:28:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[01:30:30] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.294 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:30:44] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[01:35:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P41988 and previous config saved to /var/cache/conftool/dbconfig/20221201-013503-ladsgroup.json
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P41989 and previous config saved to /var/cache/conftool/dbconfig/20221201-014136-ladsgroup.json
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:04] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:47:36] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[01:49:00] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:50:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41990 and previous config saved to /var/cache/conftool/dbconfig/20221201-015010-ladsgroup.json
[01:50:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance
[01:50:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance
[01:50:18] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[01:50:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41991 and previous config saved to /var/cache/conftool/dbconfig/20221201-015020-ladsgroup.json
[01:50:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[01:51:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[01:51:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P41992 and previous config saved to /var/cache/conftool/dbconfig/20221201-015115-ladsgroup.json
[01:51:22] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[01:52:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41993 and previous config saved to /var/cache/conftool/dbconfig/20221201-015230-ladsgroup.json
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:53:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[01:53:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[01:53:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[01:53:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P41994 and previous config saved to /var/cache/conftool/dbconfig/20221201-015332-ladsgroup.json
[01:53:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[01:53:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P41995 and previous config saved to /var/cache/conftool/dbconfig/20221201-015340-ladsgroup.json
[01:54:16] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[01:55:08] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[01:55:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P41996 and previous config saved to /var/cache/conftool/dbconfig/20221201-015550-ladsgroup.json
[01:56:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P41997 and previous config saved to /var/cache/conftool/dbconfig/20221201-015643-ladsgroup.json
[01:57:52] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:58:01] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cephosd - cmjohnson@cumin1001"
[01:59:18] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cephosd - cmjohnson@cumin1001"
[01:59:19] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[02:00:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Cmjohnson)
[02:03:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[02:03:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[02:03:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P41998 and previous config saved to /var/cache/conftool/dbconfig/20221201-020308-ladsgroup.json
[02:03:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance
[02:03:18] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[02:03:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance
[02:04:00] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.175 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:04:14] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:05:56] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:07:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P41999 and previous config saved to /var/cache/conftool/dbconfig/20221201-020737-ladsgroup.json
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:11] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[02:09:37] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[02:09:44] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[02:10:24] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.118 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:10:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42000 and previous config saved to /var/cache/conftool/dbconfig/20221201-021057-ladsgroup.json
[02:11:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T322618)', diff saved to https://phabricator.wikimedia.org/P42001 and previous config saved to /var/cache/conftool/dbconfig/20221201-021149-ladsgroup.json
[02:11:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[02:11:54] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-coord - cmjohnson@cumin1001"
[02:11:57] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[02:12:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[02:12:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T322618)', diff saved to https://phabricator.wikimedia.org/P42002 and previous config saved to /var/cache/conftool/dbconfig/20221201-021211-ladsgroup.json
[02:12:57] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-coord - cmjohnson@cumin1001"
[02:12:57] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[02:13:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T322618)', diff saved to https://phabricator.wikimedia.org/P42003 and previous config saved to /var/cache/conftool/dbconfig/20221201-021318-ladsgroup.json
[02:14:14] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[02:16:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Ladsgroup) Thanks Papaul!  @Marostegui: When provisioning this for production, I'd really appreciate if I can shadow you to learn how we add a db to rotation. Please 🥺
[02:20:30] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:20:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:20:59] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5027.eqsin.wmnet with OS buster
[02:21:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:21:09] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster executed with errors: - cp5027 (**...
[02:21:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS buster
[02:21:37] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster
[02:21:52] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5027.eqsin.wmnet with OS buster
[02:22:01] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster executed with errors: - cp5027 (**...
[02:22:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS buster
[02:22:25] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster
[02:22:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P42004 and previous config saved to /var/cache/conftool/dbconfig/20221201-022244-ladsgroup.json
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:26:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42005 and previous config saved to /var/cache/conftool/dbconfig/20221201-022603-ladsgroup.json
[02:26:36] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.248 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:27:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install an-coord100[3,4] & an-mariadb100[1,2] - https://phabricator.wikimedia.org/T321119 (10Cmjohnson)
[02:28:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P42006 and previous config saved to /var/cache/conftool/dbconfig/20221201-022825-ladsgroup.json
[02:30:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42007 and previous config saved to /var/cache/conftool/dbconfig/20221201-023027-ladsgroup.json
[02:30:34] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[02:32:45] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host druid1009.mgmt.eqiad.wmnet with reboot policy FORCED
[02:33:21] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5027.eqsin.wmnet with OS buster
[02:33:30] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5027.eqsin.wmnet with OS buster executed with errors: - cp5027 (**...
[02:33:48] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS buster
[02:35:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:37:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P42008 and previous config saved to /var/cache/conftool/dbconfig/20221201-023750-ladsgroup.json
[02:37:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance
[02:37:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance
[02:37:58] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[02:38:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T322618)', diff saved to https://phabricator.wikimedia.org/P42009 and previous config saved to /var/cache/conftool/dbconfig/20221201-023801-ladsgroup.json
[02:40:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T322618)', diff saved to https://phabricator.wikimedia.org/P42010 and previous config saved to /var/cache/conftool/dbconfig/20221201-024011-ladsgroup.json
[02:40:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:41:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P42011 and previous config saved to /var/cache/conftool/dbconfig/20221201-024110-ladsgroup.json
[02:41:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[02:41:17] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[02:41:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[02:41:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P42012 and previous config saved to /var/cache/conftool/dbconfig/20221201-024131-ladsgroup.json
[02:43:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P42013 and previous config saved to /var/cache/conftool/dbconfig/20221201-024331-ladsgroup.json
[02:43:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P42014 and previous config saved to /var/cache/conftool/dbconfig/20221201-024341-ladsgroup.json
[02:45:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42015 and previous config saved to /var/cache/conftool/dbconfig/20221201-024533-ladsgroup.json
[02:48:57] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:53:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.300 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:55:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P42016 and previous config saved to /var/cache/conftool/dbconfig/20221201-025517-ladsgroup.json
[02:55:33] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:57:43] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:58:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T322618)', diff saved to https://phabricator.wikimedia.org/P42017 and previous config saved to /var/cache/conftool/dbconfig/20221201-025838-ladsgroup.json
[02:58:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[02:58:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42018 and previous config saved to /var/cache/conftool/dbconfig/20221201-025848-ladsgroup.json
[02:58:51] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[02:58:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[02:59:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T322618)', diff saved to https://phabricator.wikimedia.org/P42019 and previous config saved to /var/cache/conftool/dbconfig/20221201-025900-ladsgroup.json
[03:00:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T322618)', diff saved to https://phabricator.wikimedia.org/P42020 and previous config saved to /var/cache/conftool/dbconfig/20221201-030007-ladsgroup.json
[03:00:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42021 and previous config saved to /var/cache/conftool/dbconfig/20221201-030040-ladsgroup.json
[03:01:23] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.163 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:03:22] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage
[03:03:45] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.231 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:05:39] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:06:49] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage
[03:09:22] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.276 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:10:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P42022 and previous config saved to /var/cache/conftool/dbconfig/20221201-031024-ladsgroup.json
[03:12:22] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:13:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42023 and previous config saved to /var/cache/conftool/dbconfig/20221201-031354-ladsgroup.json
[03:15:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P42024 and previous config saved to /var/cache/conftool/dbconfig/20221201-031514-ladsgroup.json
[03:15:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42025 and previous config saved to /var/cache/conftool/dbconfig/20221201-031546-ladsgroup.json
[03:15:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[03:15:52] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.138 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:15:54] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[03:16:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[03:16:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42026 and previous config saved to /var/cache/conftool/dbconfig/20221201-031608-ladsgroup.json
[03:18:14] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:25:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T322618)', diff saved to https://phabricator.wikimedia.org/P42027 and previous config saved to /var/cache/conftool/dbconfig/20221201-032531-ladsgroup.json
[03:25:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance
[03:25:40] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[03:25:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance
[03:25:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T322618)', diff saved to https://phabricator.wikimedia.org/P42028 and previous config saved to /var/cache/conftool/dbconfig/20221201-032553-ladsgroup.json
[03:28:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T322618)', diff saved to https://phabricator.wikimedia.org/P42029 and previous config saved to /var/cache/conftool/dbconfig/20221201-032803-ladsgroup.json
[03:29:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P42030 and previous config saved to /var/cache/conftool/dbconfig/20221201-032901-ladsgroup.json
[03:29:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[03:29:08] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[03:29:10] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:29:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance
[03:29:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P42031 and previous config saved to /var/cache/conftool/dbconfig/20221201-032922-ladsgroup.json
[03:30:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P42032 and previous config saved to /var/cache/conftool/dbconfig/20221201-033020-ladsgroup.json
[03:31:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P42033 and previous config saved to /var/cache/conftool/dbconfig/20221201-033132-ladsgroup.json
[03:34:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance
[03:34:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance
[03:34:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T323907)', diff saved to https://phabricator.wikimedia.org/P42034 and previous config saved to /var/cache/conftool/dbconfig/20221201-033449-ladsgroup.json
[03:34:56] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[03:35:17] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5027.eqsin.wmnet with OS buster
[03:36:44] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:37:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42035 and previous config saved to /var/cache/conftool/dbconfig/20221201-033710-ladsgroup.json
[03:40:44] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:43:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P42036 and previous config saved to /var/cache/conftool/dbconfig/20221201-034309-ladsgroup.json
[03:44:38] <wikibugs>	 (03PS1) 10Andrew Bogott: Add partman rules for cloudvirt10[54-61] [puppet] - 10https://gerrit.wikimedia.org/r/862406 (https://phabricator.wikimedia.org/T313983)
[03:45:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T322618)', diff saved to https://phabricator.wikimedia.org/P42037 and previous config saved to /var/cache/conftool/dbconfig/20221201-034527-ladsgroup.json
[03:45:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[03:45:35] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[03:45:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Andrew)
[03:45:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[03:45:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[03:46:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[03:46:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T322618)', diff saved to https://phabricator.wikimedia.org/P42038 and previous config saved to /var/cache/conftool/dbconfig/20221201-034627-ladsgroup.json
[03:46:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42039 and previous config saved to /var/cache/conftool/dbconfig/20221201-034639-ladsgroup.json
[03:47:00] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.118 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:47:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T322618)', diff saved to https://phabricator.wikimedia.org/P42040 and previous config saved to /var/cache/conftool/dbconfig/20221201-034734-ladsgroup.json
[03:48:54] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:52:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42041 and previous config saved to /var/cache/conftool/dbconfig/20221201-035216-ladsgroup.json
[03:55:06] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[03:55:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42042 and previous config saved to /var/cache/conftool/dbconfig/20221201-035512-ladsgroup.json
[03:55:20] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[03:56:46] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1652 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:57:56] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1652 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[03:58:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P42043 and previous config saved to /var/cache/conftool/dbconfig/20221201-035816-ladsgroup.json
[04:01:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42044 and previous config saved to /var/cache/conftool/dbconfig/20221201-040145-ladsgroup.json
[04:02:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P42045 and previous config saved to /var/cache/conftool/dbconfig/20221201-040240-ladsgroup.json
[04:06:37] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 26020 bytes in 1.628 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:07:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P42046 and previous config saved to /var/cache/conftool/dbconfig/20221201-040723-ladsgroup.json
[04:10:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42047 and previous config saved to /var/cache/conftool/dbconfig/20221201-041018-ladsgroup.json
[04:13:17] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 26018 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[04:13:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T322618)', diff saved to https://phabricator.wikimedia.org/P42048 and previous config saved to /var/cache/conftool/dbconfig/20221201-041322-ladsgroup.json
[04:13:30] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[04:13:45] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:16:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P42049 and previous config saved to /var/cache/conftool/dbconfig/20221201-041652-ladsgroup.json
[04:16:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[04:16:59] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[04:17:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[04:17:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[04:17:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P42050 and previous config saved to /var/cache/conftool/dbconfig/20221201-041747-ladsgroup.json
[04:17:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[04:17:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[04:17:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[04:17:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P42051 and previous config saved to /var/cache/conftool/dbconfig/20221201-041758-ladsgroup.json
[04:18:23] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:20:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P42052 and previous config saved to /var/cache/conftool/dbconfig/20221201-042008-ladsgroup.json
[04:22:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42053 and previous config saved to /var/cache/conftool/dbconfig/20221201-042229-ladsgroup.json
[04:22:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[04:22:37] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[04:22:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance
[04:22:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T323907)', diff saved to https://phabricator.wikimedia.org/P42054 and previous config saved to /var/cache/conftool/dbconfig/20221201-042251-ladsgroup.json
[04:23:15] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:25:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42055 and previous config saved to /var/cache/conftool/dbconfig/20221201-042525-ladsgroup.json
[04:32:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T322618)', diff saved to https://phabricator.wikimedia.org/P42056 and previous config saved to /var/cache/conftool/dbconfig/20221201-043253-ladsgroup.json
[04:32:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[04:33:02] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[04:33:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[04:33:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T322618)', diff saved to https://phabricator.wikimedia.org/P42057 and previous config saved to /var/cache/conftool/dbconfig/20221201-043315-ladsgroup.json
[04:33:35] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:34:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T322618)', diff saved to https://phabricator.wikimedia.org/P42058 and previous config saved to /var/cache/conftool/dbconfig/20221201-043422-ladsgroup.json
[04:35:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42059 and previous config saved to /var/cache/conftool/dbconfig/20221201-043514-ladsgroup.json
[04:37:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[04:39:45] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[04:40:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42060 and previous config saved to /var/cache/conftool/dbconfig/20221201-044031-ladsgroup.json
[04:40:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[04:40:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:40:39] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[04:40:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[04:40:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42061 and previous config saved to /var/cache/conftool/dbconfig/20221201-044053-ladsgroup.json
[04:49:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.139 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:49:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P42062 and previous config saved to /var/cache/conftool/dbconfig/20221201-044929-ladsgroup.json
[04:50:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42063 and previous config saved to /var/cache/conftool/dbconfig/20221201-045020-ladsgroup.json
[04:52:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:53:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:03:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.061 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:03:54] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861874 (https://phabricator.wikimedia.org/T324179)
[05:04:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P42064 and previous config saved to /var/cache/conftool/dbconfig/20221201-050435-ladsgroup.json
[05:05:17] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:05:27] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861874 (https://phabricator.wikimedia.org/T324179) (owner: 10Gerrit maintenance bot)
[05:05:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P42065 and previous config saved to /var/cache/conftool/dbconfig/20221201-050527-ladsgroup.json
[05:05:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[05:05:34] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[05:05:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[05:05:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42066 and previous config saved to /var/cache/conftool/dbconfig/20221201-050548-ladsgroup.json
[05:06:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T323907)', diff saved to https://phabricator.wikimedia.org/P42067 and previous config saved to /var/cache/conftool/dbconfig/20221201-050600-ladsgroup.json
[05:06:08] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[05:06:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42068 and previous config saved to /var/cache/conftool/dbconfig/20221201-050658-ladsgroup.json
[05:08:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T323907)', diff saved to https://phabricator.wikimedia.org/P42069 and previous config saved to /var/cache/conftool/dbconfig/20221201-050818-ladsgroup.json
[05:10:50] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2105 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/861875 (https://phabricator.wikimedia.org/T324180)
[05:15:35] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.256 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:16:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42070 and previous config saved to /var/cache/conftool/dbconfig/20221201-051640-ladsgroup.json
[05:16:49] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[05:18:41] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:19:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T322618)', diff saved to https://phabricator.wikimedia.org/P42071 and previous config saved to /var/cache/conftool/dbconfig/20221201-051942-ladsgroup.json
[05:19:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[05:19:50] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[05:20:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[05:20:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T322618)', diff saved to https://phabricator.wikimedia.org/P42072 and previous config saved to /var/cache/conftool/dbconfig/20221201-052014-ladsgroup.json
[05:20:41] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:21:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42073 and previous config saved to /var/cache/conftool/dbconfig/20221201-052107-ladsgroup.json
[05:22:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42074 and previous config saved to /var/cache/conftool/dbconfig/20221201-052205-ladsgroup.json
[05:22:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T322618)', diff saved to https://phabricator.wikimedia.org/P42075 and previous config saved to /var/cache/conftool/dbconfig/20221201-052223-ladsgroup.json
[05:23:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42076 and previous config saved to /var/cache/conftool/dbconfig/20221201-052325-ladsgroup.json
[05:24:05] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:25:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T322618)', diff saved to https://phabricator.wikimedia.org/P42077 and previous config saved to /var/cache/conftool/dbconfig/20221201-052524-ladsgroup.json
[05:25:32] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[05:31:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42078 and previous config saved to /var/cache/conftool/dbconfig/20221201-053147-ladsgroup.json
[05:34:27] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.229 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:36:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P42079 and previous config saved to /var/cache/conftool/dbconfig/20221201-053613-ladsgroup.json
[05:37:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42080 and previous config saved to /var/cache/conftool/dbconfig/20221201-053711-ladsgroup.json
[05:38:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P42081 and previous config saved to /var/cache/conftool/dbconfig/20221201-053831-ladsgroup.json
[05:46:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42082 and previous config saved to /var/cache/conftool/dbconfig/20221201-054653-ladsgroup.json
[05:49:14] <wikibugs>	 (03PS1) 10KartikMistry: testwiki: Enable Section Translation for 15 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862412 (https://phabricator.wikimedia.org/T323825)
[05:51:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T323907)', diff saved to https://phabricator.wikimedia.org/P42083 and previous config saved to /var/cache/conftool/dbconfig/20221201-055120-ladsgroup.json
[05:51:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance
[05:51:31] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[05:51:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance
[05:51:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T323907)', diff saved to https://phabricator.wikimedia.org/P42084 and previous config saved to /var/cache/conftool/dbconfig/20221201-055142-ladsgroup.json
[05:52:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42085 and previous config saved to /var/cache/conftool/dbconfig/20221201-055218-ladsgroup.json
[05:52:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[05:52:25] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[05:52:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[05:52:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42086 and previous config saved to /var/cache/conftool/dbconfig/20221201-055239-ladsgroup.json
[05:53:23] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 ret
[05:53:23] <icinga-wm>	 e unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read article
[05:53:23] <icinga-wm>	 nuary 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[05:53:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T323907)', diff saved to https://phabricator.wikimedia.org/P42087 and previous config saved to /var/cache/conftool/dbconfig/20221201-055337-ladsgroup.json
[05:53:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[05:53:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42088 and previous config saved to /var/cache/conftool/dbconfig/20221201-055349-ladsgroup.json
[05:53:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[05:54:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T323907)', diff saved to https://phabricator.wikimedia.org/P42089 and previous config saved to /var/cache/conftool/dbconfig/20221201-055359-ladsgroup.json
[05:56:55] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:57:33] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:01:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 37 hosts with reason: Primary switchover s1 T323547
[06:01:22] <stashbot>	 T323547: Switchover s1 master (db1163 -> db1118) - https://phabricator.wikimedia.org/T323547
[06:01:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 37 hosts with reason: Primary switchover s1 T323547
[06:01:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1118 with weight 0 T323547', diff saved to https://phabricator.wikimedia.org/P42090 and previous config saved to /var/cache/conftool/dbconfig/20221201-060157-ladsgroup.json
[06:02:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42091 and previous config saved to /var/cache/conftool/dbconfig/20221201-060206-ladsgroup.json
[06:02:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[06:02:15] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[06:02:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[06:02:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T318605)', diff saved to https://phabricator.wikimedia.org/P42092 and previous config saved to /var/cache/conftool/dbconfig/20221201-060230-ladsgroup.json
[06:03:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.080 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:06:23] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:08:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42093 and previous config saved to /var/cache/conftool/dbconfig/20221201-060855-ladsgroup.json
[06:09:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:12:09] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) Thank you Papaul!
[06:12:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318605)', diff saved to https://phabricator.wikimedia.org/P42094 and previous config saved to /var/cache/conftool/dbconfig/20221201-061218-ladsgroup.json
[06:12:26] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[06:16:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.123 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:21:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui)
[06:22:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui)
[06:24:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42095 and previous config saved to /var/cache/conftool/dbconfig/20221201-062402-ladsgroup.json
[06:27:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42096 and previous config saved to /var/cache/conftool/dbconfig/20221201-062724-ladsgroup.json
[06:27:31] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.267 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:27:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui) Looks like @Damilare is already part of the `analytics-privatedata-users`: T319057
[06:29:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10Marostegui) 05Open→03Resolved I am going to close this, please reopen if adding you to the WMF group wasn't enough.
[06:30:37] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[06:30:55] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:30:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T323907)', diff saved to https://phabricator.wikimedia.org/P42097 and previous config saved to /var/cache/conftool/dbconfig/20221201-063055-ladsgroup.json
[06:30:57] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[06:31:00] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[06:35:51] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[06:36:49] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[06:37:17] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/858382 (https://phabricator.wikimedia.org/T323547) (owner: 10Gerrit maintenance bot)
[06:37:22] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/858382 (https://phabricator.wikimedia.org/T323547) (owner: 10Gerrit maintenance bot)
[06:39:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P42098 and previous config saved to /var/cache/conftool/dbconfig/20221201-063908-ladsgroup.json
[06:39:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[06:39:12] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[06:39:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[06:39:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42099 and previous config saved to /var/cache/conftool/dbconfig/20221201-063930-ladsgroup.json
[06:40:43] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[06:41:13] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:41:17] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.253 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:41:21] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[06:41:37] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:41:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42100 and previous config saved to /var/cache/conftool/dbconfig/20221201-064140-ladsgroup.json
[06:42:02] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[06:42:07] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Marostegui) @andrea.denisse - can you merge and submit your patch so we can create the kerberos principal and close this task? Thanks!
[06:42:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Marostegui) a:05Jcross→03andrea.denisse
[06:42:19] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[06:42:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42101 and previous config saved to /var/cache/conftool/dbconfig/20221201-064230-ladsgroup.json
[06:43:41] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:44:52] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: WIP: Add cassandra-table-properties tool to Cassandra deployments [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226553) (owner: 10Holger Knust)
[06:45:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] function-evaluator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860829 (owner: 10Giuseppe Lavagetto)
[06:45:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] function-orchestrator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860830 (owner: 10Giuseppe Lavagetto)
[06:46:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42102 and previous config saved to /var/cache/conftool/dbconfig/20221201-064602-ladsgroup.json
[06:50:13] <wikibugs>	 (03Merged) 10jenkins-bot: function-evaluator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860829 (owner: 10Giuseppe Lavagetto)
[06:50:25] <wikibugs>	 (03Merged) 10jenkins-bot: function-orchestrator: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860830 (owner: 10Giuseppe Lavagetto)
[06:51:27] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1206 [puppet] - 10https://gerrit.wikimedia.org/r/862643
[06:56:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1206 [puppet] - 10https://gerrit.wikimedia.org/r/862643 (owner: 10Marostegui)
[06:56:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42103 and previous config saved to /var/cache/conftool/dbconfig/20221201-065646-ladsgroup.json
[06:57:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318605)', diff saved to https://phabricator.wikimedia.org/P42104 and previous config saved to /var/cache/conftool/dbconfig/20221201-065737-ladsgroup.json
[06:57:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[06:57:40] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[06:57:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[07:00:04] <jouncebot>	 kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T0700).
[07:00:37] <Amir1>	 let's get the party started
[07:01:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P42105 and previous config saved to /var/cache/conftool/dbconfig/20221201-070108-ladsgroup.json
[07:01:21] <Amir1>	 !log Starting s1 eqiad failover from db1163 to db1118 - T323547
[07:01:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:24] <stashbot>	 T323547: Switchover s1 master (db1163 -> db1118) - https://phabricator.wikimedia.org/T323547
[07:01:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T323547', diff saved to https://phabricator.wikimedia.org/P42106 and previous config saved to /var/cache/conftool/dbconfig/20221201-070131-ladsgroup.json
[07:01:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki-dev: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860713 (owner: 10Giuseppe Lavagetto)
[07:02:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1118 to s1 primary and set section read-write T323547', diff saved to https://phabricator.wikimedia.org/P42107 and previous config saved to /var/cache/conftool/dbconfig/20221201-070203-ladsgroup.json
[07:03:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] tegola-vector-tiles: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860707 (owner: 10Giuseppe Lavagetto)
[07:05:23] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/858383 (https://phabricator.wikimedia.org/T323547) (owner: 10Gerrit maintenance bot)
[07:06:31] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift
[07:07:09] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/858383 (https://phabricator.wikimedia.org/T323547) (owner: 10Gerrit maintenance bot)
[07:07:30] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-dev: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860713 (owner: 10Giuseppe Lavagetto)
[07:07:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1163 T323547', diff saved to https://phabricator.wikimedia.org/P42108 and previous config saved to /var/cache/conftool/dbconfig/20221201-070758-ladsgroup.json
[07:08:02] <stashbot>	 T323547: Switchover s1 master (db1163 -> db1118) - https://phabricator.wikimedia.org/T323547
[07:08:07] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860707 (owner: 10Giuseppe Lavagetto)
[07:09:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:09:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:11:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42109 and previous config saved to /var/cache/conftool/dbconfig/20221201-071153-ladsgroup.json
[07:12:26] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[07:12:49] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.113 second response time https://wikitech.wikimedia.org/wiki/Swift
[07:12:54] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[07:13:18] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[07:13:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply
[07:13:44] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[07:14:07] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[07:16:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T323907)', diff saved to https://phabricator.wikimedia.org/P42110 and previous config saved to /var/cache/conftool/dbconfig/20221201-071615-ladsgroup.json
[07:16:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance
[07:16:19] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[07:16:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance
[07:16:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[07:16:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[07:16:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T323907)', diff saved to https://phabricator.wikimedia.org/P42111 and previous config saved to /var/cache/conftool/dbconfig/20221201-071641-ladsgroup.json
[07:16:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift
[07:18:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:18:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:19:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:19:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:19:35] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:20:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:20:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[07:21:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] flink-session-cluster: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860828 (owner: 10Giuseppe Lavagetto)
[07:21:39] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:22:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Swift
[07:22:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Please when you merge  this change, remember to go to the puppet private repository and change the "tls:" stanzas for eventgate to "mesh:"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/860518 (owner: 10Giuseppe Lavagetto)
[07:23:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[07:26:05] <wikibugs>	 (03Merged) 10jenkins-bot: flink-session-cluster: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860828 (owner: 10Giuseppe Lavagetto)
[07:27:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42113 and previous config saved to /var/cache/conftool/dbconfig/20221201-072659-ladsgroup.json
[07:27:04] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:29:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T323907)', diff saved to https://phabricator.wikimedia.org/P42114 and previous config saved to /var/cache/conftool/dbconfig/20221201-072914-ladsgroup.json
[07:29:18] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[07:29:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[07:29:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[07:29:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[07:30:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[07:30:13] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.302 second response time https://wikitech.wikimedia.org/wiki/Swift
[07:30:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T318605)', diff saved to https://phabricator.wikimedia.org/P42115 and previous config saved to /var/cache/conftool/dbconfig/20221201-073015-ladsgroup.json
[07:35:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:36:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318605)', diff saved to https://phabricator.wikimedia.org/P42116 and previous config saved to /var/cache/conftool/dbconfig/20221201-073634-ladsgroup.json
[07:36:38] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:37:27] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[07:41:13] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.095 second response time https://wikitech.wikimedia.org/wiki/Swift
[07:43:01] <wikibugs>	 (03CR) 10Muehlenhoff: scap: move firewall rules out of the module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/862378 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn)
[07:44:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42117 and previous config saved to /var/cache/conftool/dbconfig/20221201-074420-ladsgroup.json
[07:49:15] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove dns5001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/862321 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[07:49:27] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10MoritzMuehlenhoff) Whether the user is created via adduser or systemd::sysuser doesn't matter, the fix is to have a reserved UID defined via data.yaml in th...
[07:51:41] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 400474
[07:51:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42118 and previous config saved to /var/cache/conftool/dbconfig/20221201-075140-ladsgroup.json
[07:52:05] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 400474
[07:55:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P42119 and previous config saved to /var/cache/conftool/dbconfig/20221201-075506-ladsgroup.json
[07:55:10] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[07:56:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T323907)', diff saved to https://phabricator.wikimedia.org/P42120 and previous config saved to /var/cache/conftool/dbconfig/20221201-075606-ladsgroup.json
[07:56:09] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[07:59:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P42122 and previous config saved to /var/cache/conftool/dbconfig/20221201-075927-ladsgroup.json
[08:00:05] <jouncebot>	 Amir1, apergos, and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T0800).
[08:00:14] <apergos>	 morning! once again there are no trainees signed up today for training and no patches scheduled for deployment during the window. so we'll see you next time!
[08:01:07] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:05:18] <wikibugs>	 (03CR) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova)
[08:05:23] <wikibugs>	 (03PS2) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282)
[08:06:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova)
[08:06:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42123 and previous config saved to /var/cache/conftool/dbconfig/20221201-080647-ladsgroup.json
[08:07:27] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[08:10:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P42124 and previous config saved to /var/cache/conftool/dbconfig/20221201-081013-ladsgroup.json
[08:11:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42125 and previous config saved to /var/cache/conftool/dbconfig/20221201-081112-ladsgroup.json
[08:11:57] <wikibugs>	 (03PS3) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282)
[08:12:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova)
[08:14:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T323907)', diff saved to https://phabricator.wikimedia.org/P42126 and previous config saved to /var/cache/conftool/dbconfig/20221201-081433-ladsgroup.json
[08:14:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[08:14:38] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[08:14:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[08:14:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42127 and previous config saved to /var/cache/conftool/dbconfig/20221201-081444-ladsgroup.json
[08:15:31] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:15:59] <wikibugs>	 (03PS4) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282)
[08:20:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:21:51] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.254 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:21:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318605)', diff saved to https://phabricator.wikimedia.org/P42128 and previous config saved to /var/cache/conftool/dbconfig/20221201-082154-ladsgroup.json
[08:21:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[08:21:58] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[08:22:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[08:22:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T318605)', diff saved to https://phabricator.wikimedia.org/P42129 and previous config saved to /var/cache/conftool/dbconfig/20221201-082215-ladsgroup.json
[08:23:23] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: eventgate: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860518
[08:23:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: calculator-service: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/862829
[08:24:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] calculator-service: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/862829 (owner: 10Giuseppe Lavagetto)
[08:25:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P42130 and previous config saved to /var/cache/conftool/dbconfig/20221201-082519-ladsgroup.json
[08:26:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P42131 and previous config saved to /var/cache/conftool/dbconfig/20221201-082619-ladsgroup.json
[08:27:09] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:30:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:33:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.095 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:36:43] <_joe_>	 not sure how we have a backend on ms-fe
[08:39:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318605)', diff saved to https://phabricator.wikimedia.org/P42134 and previous config saved to /var/cache/conftool/dbconfig/20221201-083914-ladsgroup.json
[08:39:18] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[08:40:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P42135 and previous config saved to /var/cache/conftool/dbconfig/20221201-084026-ladsgroup.json
[08:40:45] <volans>	 _joe_: check_https_url!ms-fe.svc.eqiad.wmnet!/monitoring/backend
[08:41:07] <_joe_>	 volans: yeah saw on icinga's interface
[08:41:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T323907)', diff saved to https://phabricator.wikimedia.org/P42136 and previous config saved to /var/cache/conftool/dbconfig/20221201-084125-ladsgroup.json
[08:41:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[08:41:29] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[08:41:34] <volans>	 now what that does check...
[08:41:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[08:41:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42137 and previous config saved to /var/cache/conftool/dbconfig/20221201-084147-ladsgroup.json
[08:42:15] <icinga-wm>	 PROBLEM - Host mw1334 is DOWN: PING CRITICAL - Packet loss = 100%
[08:42:45] <volans>	 uh
[08:43:03] <kostajh>	 apergos: I might backport something to wmf.12 if that's still possible
[08:43:17] <icinga-wm>	 RECOVERY - Host mw1334 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[08:43:17] <volans>	 I'm checking the mgmt
[08:44:13] <volans>	 _joe_: mw1334 just got rebooted
[08:44:43] <volans>	 nothing in syslog, I'll check hw logs
[08:44:55] <_joe_>	 uh
[08:46:03] <volans>	 and ofc... ssh to mgmt doesn't work
[08:46:16] <volans>	 troibleshooting
[08:47:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[08:48:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[08:48:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[08:49:00] <volans>	 !log restart idrac on mw1334, ipmi and remote ipmi works fine, ssh not responding
[08:49:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[08:49:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
[08:50:27] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:51] <wikibugs>	 (03PS1) 10Kosta Harlan: User impact: Fix per-page pageview numbers [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862354 (https://phabricator.wikimedia.org/T323253)
[08:51:23] <kostajh>	 _joe_ / volans is it OK if I backport something to wmf.12 now, or would that interfere with what you are troublehsooting?
[08:51:28] <kostajh>	 *troubleshooting, even
[08:51:44] <_joe_>	 no go on please
[08:51:47] <volans>	 kostajh: go on
[08:52:01] <kostajh>	 thx
[08:52:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:53:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862354 (https://phabricator.wikimedia.org/T323253) (owner: 10Kosta Harlan)
[08:54:21] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:54:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42138 and previous config saved to /var/cache/conftool/dbconfig/20221201-085421-ladsgroup.json
[08:55:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:56:36] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ganeti2013 is CRITICAL: CRITICAL: State: degraded, Active: 9, Working: 9, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T324185 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[08:56:41] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T324185 (10ops-monitoring-bot)
[09:00:43] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.140 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:01:08] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2013.codfw.wmnet
[09:02:21] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.098 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:02:39] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:04:45] <volans>	 Emperor: is this "normal" ^^^
[09:04:50] <volans>	 has been flapping for a bit
[09:07:41] <moritzm>	 !log rebuilding raid on ganeti2013 T323222
[09:07:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:45] <stashbot>	 T323222: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222
[09:08:21] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:08:27] <Emperor>	 volans: no :-/
[09:08:58] <volans>	 there is no mention of the backend monitoring on wikitech
[09:09:08] <volans>	 so not sure how to further debug without starting reading code
[09:09:18] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: Fix per-page pageview numbers [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862354 (https://phabricator.wikimedia.org/T323253) (owner: 10Kosta Harlan)
[09:09:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42139 and previous config saved to /var/cache/conftool/dbconfig/20221201-090927-ladsgroup.json
[09:09:52] <Emperor>	 ms-fe1011 looks only lightly loaded to me currently
[09:09:52] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:862354|User impact: Fix per-page pageview numbers (T323253)]]
[09:09:55] <stashbot>	 T323253: NewImpact module: Page view data should be limited to when user made their first edit - https://phabricator.wikimedia.org/T323253
[09:10:21] <wikibugs>	 (03PS1) 10Volans: setup.py: add temporary upper limit for pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/862830
[09:10:27] <Emperor>	 but there are a bunch of errors in server.log
[09:10:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] setup.py: add temporary upper limit for pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/862830 (owner: 10Volans)
[09:11:01] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:11:08] <logmsgbot>	 !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:862354|User impact: Fix per-page pageview numbers (T323253)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[09:11:36] <volans>	 Emperor: let us (oncallers) know if we can be of any assistance
[09:11:54] <Emperor>	 it's not a behaviour I've seen before
[09:12:58] <wikibugs>	 (03PS18) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568
[09:13:18] <wikibugs>	 (03CR) 10Volans: [C: 03+2] setup.py: add temporary upper limit for pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/862830 (owner: 10Volans)
[09:13:32] <Emperor>	 and the backtraces on ms-fe1011 are largely from inside python libraries rather than swift (IYSWIM)
[09:14:30] <Emperor>	 seem to have started around 15:55:11 yesterday
[09:14:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[09:14:41] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[09:15:08] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38543/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede)
[09:15:09] <Emperor>	 !log depool, restart, repool swift-proxy on ms-fe1011
[09:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:19] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: add temporary upper limit for pylint [cookbooks] - 10https://gerrit.wikimedia.org/r/862830 (owner: 10Volans)
[09:15:51] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:19] <volans>	 Emperor: as unrelated it might seem, the only thing that happened at that time was to shutdown thumbor2004 for idrac maintenance
[09:16:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:16:56] <Emperor>	 volans: that seems entirely plausibly related, is it still off? swift talks to thumbor so I could well believe thumbor being unhappy would be enough to make swift unhappy
[09:17:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:17:15] <volans>	 Emperor: but cross-dc?
[09:17:36] <volans>	 thumbor2004 is up since 16h17m
[09:18:05] <wikibugs>	 (03PS19) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568
[09:18:24] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:862354|User impact: Fix per-page pageview numbers (T323253)]] (duration: 08m 31s)
[09:18:27] <stashbot>	 T323253: NewImpact module: Page view data should be limited to when user made their first edit - https://phabricator.wikimedia.org/T323253
[09:19:11] <kostajh>	 !log UTC morning deploys done
[09:19:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:33] <Emperor>	 volans: probably coincidence, then. ms-fe1011 has swift-proxy and nginx restarted and repooled, let's see how it goes
[09:19:43] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38544/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede)
[09:19:56] <volans>	 ack
[09:20:40] <wikibugs>	 (03CR) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede)
[09:21:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[09:21:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[09:22:05] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:23:31] <Emperor>	 volans: seems to be behaving itself now (no more backtraces in server.log)
[09:24:13] <volans>	 good! thanks
[09:24:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede)
[09:24:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318605)', diff saved to https://phabricator.wikimedia.org/P42140 and previous config saved to /var/cache/conftool/dbconfig/20221201-092434-ladsgroup.json
[09:24:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[09:24:38] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[09:24:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[09:24:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42141 and previous config saved to /var/cache/conftool/dbconfig/20221201-092455-ladsgroup.json
[09:27:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[09:29:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10MoritzMuehlenhoff)
[09:29:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Release-Engineering-Team: Enforce reference to Phabricator task for all commits to modules/admin/data/data.yaml - https://phabricator.wikimedia.org/T142827 (10MoritzMuehlenhoff) 05Open→03Declined Mid-term data.yaml will be generated via the IDM which will include pr...
[09:30:25] <wikibugs>	 (03PS1) 10Slyngshede: ldap:client:utils remove outdated ldaplist util. [puppet] - 10https://gerrit.wikimedia.org/r/862833 (https://phabricator.wikimedia.org/T114063)
[09:32:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42142 and previous config saved to /var/cache/conftool/dbconfig/20221201-093214-ladsgroup.json
[09:32:17] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[09:34:44] <wikibugs>	 (03PS2) 10Slyngshede: ldap:client:utils remove outdated ldaplist util. [puppet] - 10https://gerrit.wikimedia.org/r/862833 (https://phabricator.wikimedia.org/T114063)
[09:40:25] <wikibugs>	 (03PS7) 10David Caro: wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089
[09:41:00] <wikibugs>	 (03PS7) 10David Caro: cookbooks: print out instructions on next step after updating the            buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe)
[09:41:40] <wikibugs>	 (03CR) 10David Caro: "It was due to https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/862830" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe)
[09:43:55] <wikibugs>	 (03PS1) 10Kosta Harlan: DatabaseUserImpactStore: Fix parameter style for upsert keys [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862355 (https://phabricator.wikimedia.org/T324188)
[09:47:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42143 and previous config saved to /var/cache/conftool/dbconfig/20221201-094720-ladsgroup.json
[09:49:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42144 and previous config saved to /var/cache/conftool/dbconfig/20221201-094907-ladsgroup.json
[09:49:10] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[09:49:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro)
[09:49:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.create_instance_with_prefix: Add a sec group default (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro)
[09:50:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hiera: depool graphite1004 for reads [puppet] - 10https://gerrit.wikimedia.org/r/862838 (https://phabricator.wikimedia.org/T324089)
[09:51:07] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T324185 (10Kizule)
[09:51:10] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10Kizule)
[09:51:38] <wikibugs>	 (03PS2) 10Filippo Giunchedi: decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089)
[09:52:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Good riddance!" [puppet] - 10https://gerrit.wikimedia.org/r/862833 (https://phabricator.wikimedia.org/T114063) (owner: 10Slyngshede)
[09:52:38] <wikibugs>	 (03PS7) 10Kosta Harlan: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686)
[09:52:40] <wikibugs>	 (03PS6) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526)
[09:52:42] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526)
[09:52:44] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable new impact module on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862840 (https://phabricator.wikimedia.org/T323686)
[09:53:03] <wikibugs>	 (03CR) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan)
[09:53:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Set role_contacts for role analytics_cluster::coordinator::replica [puppet] - 10https://gerrit.wikimedia.org/r/860608 (owner: 10Muehlenhoff)
[09:53:46] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro)
[09:54:10] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] harbor: remove support for <bullseye [puppet] - 10https://gerrit.wikimedia.org/r/860623 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro)
[09:54:13] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] harbor: remove unused harbor::db module/role [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro)
[09:54:15] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] toolforge harbor: update certs with acmechief [puppet] - 10https://gerrit.wikimedia.org/r/728629 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm)
[09:54:18] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] harbor: ensure that it's started [puppet] - 10https://gerrit.wikimedia.org/r/860896 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro)
[09:56:09] <icinga-wm>	 RECOVERY - DPKG on grafana1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[09:56:44] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] [no-op] GrowthExperiments: Enable D3 in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergő Tisza)
[09:57:00] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan)
[09:57:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti5004 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/862841
[09:58:21] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-2] GrowthExperiments: Run refreshUserImpactData maintenance script in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan)
[09:58:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti5004 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/862841 (owner: 10Muehlenhoff)
[10:00:26] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me. Feel free to merge at will." [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[10:01:19] <wikibugs>	 (03PS2) 10David Caro: quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294
[10:01:39] <wikibugs>	 (03PS2) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952)
[10:02:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42145 and previous config saved to /var/cache/conftool/dbconfig/20221201-100227-ladsgroup.json
[10:04:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42146 and previous config saved to /var/cache/conftool/dbconfig/20221201-100413-ladsgroup.json
[10:04:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro)
[10:04:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro)
[10:09:03] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: move traffic rules to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/861825 (https://phabricator.wikimedia.org/T288196)
[10:09:09] <wikibugs>	 (03PS3) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952)
[10:10:22] <wikibugs>	 (03PS6) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793)
[10:11:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, there's a few more places where 1004 is referenced (such as service::catalog in Hiera), but those can happen in subsequent com" [puppet] - 10https://gerrit.wikimedia.org/r/862838 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi)
[10:12:17] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: calculator-service: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/862829
[10:12:19] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/862842
[10:12:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro)
[10:12:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/862842 (owner: 10Giuseppe Lavagetto)
[10:13:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42147 and previous config saved to /var/cache/conftool/dbconfig/20221201-101356-ladsgroup.json
[10:14:00] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[10:16:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/862838 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi)
[10:16:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hiera: depool graphite1004 for reads [puppet] - 10https://gerrit.wikimedia.org/r/862838 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi)
[10:16:29] <wikibugs>	 (03PS5) 10Stevemunene: Add an-presto1006 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783)
[10:17:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318605)', diff saved to https://phabricator.wikimedia.org/P42148 and previous config saved to /var/cache/conftool/dbconfig/20221201-101733-ladsgroup.json
[10:17:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[10:17:37] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[10:17:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[10:17:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T318605)', diff saved to https://phabricator.wikimedia.org/P42149 and previous config saved to /var/cache/conftool/dbconfig/20221201-101754-ladsgroup.json
[10:18:23] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Add an-presto1006 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[10:19:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P42150 and previous config saved to /var/cache/conftool/dbconfig/20221201-101920-ladsgroup.json
[10:20:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet
[10:23:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) @dcaro also advised me this can be set for the monitor traffic also, with 'mon_use_min_delay_socket':  https://github.com/c...
[10:23:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318605)', diff saved to https://phabricator.wikimedia.org/P42151 and previous config saved to /var/cache/conftool/dbconfig/20221201-102357-ladsgroup.json
[10:24:00] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[10:28:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet
[10:29:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42152 and previous config saved to /var/cache/conftool/dbconfig/20221201-102903-ladsgroup.json
[10:34:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1
[10:34:21] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1
[10:34:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42153 and previous config saved to /var/cache/conftool/dbconfig/20221201-103426-ladsgroup.json
[10:34:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[10:34:30] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[10:34:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[10:34:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42154 and previous config saved to /var/cache/conftool/dbconfig/20221201-103448-ladsgroup.json
[10:35:07] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hiera: set thanos-web service to production [puppet] - 10https://gerrit.wikimedia.org/r/862843 (https://phabricator.wikimedia.org/T323913)
[10:35:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hiera: set thanos-web service to production [puppet] - 10https://gerrit.wikimedia.org/r/862843 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi)
[10:36:08] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/862304 (https://phabricator.wikimedia.org/T324091) (owner: 10Clément Goubert)
[10:36:57] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hiera: set thanos-web service to production [puppet] - 10https://gerrit.wikimedia.org/r/862843 (https://phabricator.wikimedia.org/T323913)
[10:37:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet
[10:39:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42155 and previous config saved to /var/cache/conftool/dbconfig/20221201-103903-ladsgroup.json
[10:44:00] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) For the record, we decided to start with option 3 and we're starting with rollout phase 1, specifically we'll move test2.wikipedia.org to kubernetes first.
[10:44:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P42156 and previous config saved to /var/cache/conftool/dbconfig/20221201-104409-ladsgroup.json
[10:44:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Oncall has been notified" [puppet] - 10https://gerrit.wikimedia.org/r/862843 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi)
[10:45:19] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: trafficserver: move test2wiki to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536)
[10:45:49] <wikibugs>	 10SRE, 10SRE-Access-Requests: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui)
[10:45:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui) p:05Triage→03High
[10:45:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:48:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Turnilo access request for User:wfan - https://phabricator.wikimedia.org/T324057 (10Marostegui) Thanks Greg!. I will wait for the correct template and then proceed.
[10:48:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "Revert thanos-web discovery record" [dns] - 10https://gerrit.wikimedia.org/r/862356
[10:52:46] <wikibugs>	 (03PS1) 10Michael Große: Fix broken search with vector-2022 on www.wikidata.org [extensions/Wikibase] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862357 (https://phabricator.wikimedia.org/T324148)
[10:53:01] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] prometheus: move traffic rules to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/861825 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi)
[10:53:07] <Lucas_WMDE>	 jouncebot: nowandnext
[10:53:07] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 6 minute(s)
[10:53:07] <jouncebot>	 In 0 hour(s) and 6 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1100)
[10:53:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "Revert thanos-web discovery record" [dns] - 10https://gerrit.wikimedia.org/r/862356 (owner: 10Filippo Giunchedi)
[10:53:55] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Revert "Revert thanos-web discovery record" [dns] - 10https://gerrit.wikimedia.org/r/862356
[10:54:03] <Lucas_WMDE>	 does anyone mind if I backport “fix broken search with vector-2022 on www.wikidata.org” (just above) now, without waiting for the afternoon backport window?
[10:54:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42157 and previous config saved to /var/cache/conftool/dbconfig/20221201-105410-ladsgroup.json
[10:55:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet
[10:55:45] <_joe_>	 Lucas_WMDE: go ahead
[10:55:52] <Lucas_WMDE>	 ok, thanks
[10:55:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:56:19] * MichaelG_WMDE is here as well
[10:56:33] <elukey>	 !log deleted knative controller + net-istio controllers on ml-serve-eqiad to clear out some weird state (causing high latencies for the k8s api)
[10:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862357 (https://phabricator.wikimedia.org/T324148) (owner: 10Michael Große)
[10:57:11] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thanos-web
[10:57:21] <Lucas_WMDE>	 CI will probably take 15 minutes or so, so there’s plenty of time for someone else to object ;)
[10:59:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42158 and previous config saved to /var/cache/conftool/dbconfig/20221201-105916-ladsgroup.json
[10:59:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance
[10:59:20] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[10:59:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance
[10:59:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T323907)', diff saved to https://phabricator.wikimedia.org/P42159 and previous config saved to /var/cache/conftool/dbconfig/20221201-105938-ladsgroup.json
[11:00:04] <jouncebot>	 mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1100)
[11:00:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[11:00:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[11:01:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] trafficserver: move test2wiki to kubernetes (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto)
[11:02:41] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] APIGW/Liftwing: Fix missing part of path regexen [deployment-charts] - 10https://gerrit.wikimedia.org/r/862311 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman)
[11:05:07] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377 (10jbond) >>! In T95377#8433934, @Dzahn wrote: > @jbond and all. I wonder what you would think about this now in 2022. >  > Are the barew...
[11:05:37] <wikibugs>	 10SRE-OnFire, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor high load on etcd/conf* hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10JMeybohm)
[11:07:27] <wikibugs>	 (03Merged) 10jenkins-bot: APIGW/Liftwing: Fix missing part of path regexen [deployment-charts] - 10https://gerrit.wikimedia.org/r/862311 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman)
[11:09:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318605)', diff saved to https://phabricator.wikimedia.org/P42160 and previous config saved to /var/cache/conftool/dbconfig/20221201-110916-ladsgroup.json
[11:09:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[11:09:20] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[11:09:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[11:09:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T318605)', diff saved to https://phabricator.wikimedia.org/P42161 and previous config saved to /var/cache/conftool/dbconfig/20221201-110938-ladsgroup.json
[11:10:41] <wikibugs>	 (03Merged) 10jenkins-bot: Fix broken search with vector-2022 on www.wikidata.org [extensions/Wikibase] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862357 (https://phabricator.wikimedia.org/T324148) (owner: 10Michael Große)
[11:11:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:862357|Fix broken search with vector-2022 on www.wikidata.org (T324148)]]
[11:11:10] <stashbot>	 T324148: search box not working with Vector 2022 on Wikidata - https://phabricator.wikimedia.org/T324148
[11:12:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:862357|Fix broken search with vector-2022 on www.wikidata.org (T324148)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[11:12:22] <Lucas_WMDE>	 MichaelG_WMDE: ^
[11:12:26] <Lucas_WMDE>	 testing on mwdebug
[11:12:51] * MichaelG_WMDE looks
[11:12:53] <Lucas_WMDE>	 seems to work fine as far as I can tell
[11:13:01] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looking good, see comments about x-wikimedia-debug" [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto)
[11:13:34] <MichaelG_WMDE>	 @Lucas_WMDE same!
[11:13:39] <Lucas_WMDE>	 ok, syncing
[11:14:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:15:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:15:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:15:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318605)', diff saved to https://phabricator.wikimedia.org/P42162 and previous config saved to /var/cache/conftool/dbconfig/20221201-111542-ladsgroup.json
[11:15:46] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[11:15:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:16:05] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "This change is ready for review." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[11:16:18] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, one optional request" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova)
[11:16:52] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[11:17:29] <wikibugs>	 (03PS4) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952)
[11:18:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:862357|Fix broken search with vector-2022 on www.wikidata.org (T324148)]] (duration: 06m 56s)
[11:18:07] <stashbot>	 T324148: search box not working with Vector 2022 on Wikidata - https://phabricator.wikimedia.org/T324148
[11:18:35] <Lucas_WMDE>	 looks like it’s working without mwdebug now
[11:18:36] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Rewrite as kubernetes operator/controller (035 comments) [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[11:18:36] * Lucas_WMDE done
[11:18:50] <wikibugs>	 (03PS1) 10Jbond: do not merge: test CI [puppet] - 10https://gerrit.wikimedia.org/r/862848 (https://phabricator.wikimedia.org/T95377)
[11:19:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] do not merge: test CI [puppet] - 10https://gerrit.wikimedia.org/r/862848 (https://phabricator.wikimedia.org/T95377) (owner: 10Jbond)
[11:20:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro)
[11:20:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:21:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:21:28] <wikibugs>	 (03CR) 10Elukey: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[11:21:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:21:40] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377 (10jbond) >>! In T95377#8435033, @jbond wrote: >> Or... does the fact that we haven't done it since 2015 just show...
[11:25:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:25:57] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Vgutierrez) I can confirm that they've added HSTS support and stopped serving traffic in port 80 and redirect it to port 443: ` $ curl -I l...
[11:26:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] kube-env: Move environments and services to config [puppet] - 10https://gerrit.wikimedia.org/r/862304 (https://phabricator.wikimedia.org/T324091) (owner: 10Clément Goubert)
[11:26:28] <wikibugs>	 (03PS5) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952)
[11:27:04] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] kube-env: Move environments and services to config [puppet] - 10https://gerrit.wikimedia.org/r/862304 (https://phabricator.wikimedia.org/T324091) (owner: 10Clément Goubert)
[11:30:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42163 and previous config saved to /var/cache/conftool/dbconfig/20221201-113049-ladsgroup.json
[11:32:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet
[11:35:51] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:37:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10BTullis) @cmooney - This looks very useful. We can certainly look at using `osd_heartbeat_use_min_delay_socket=true` from the outset...
[11:37:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:38:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:40:37] <volans>	 _joe_: I think this is known right? anything to do about it? ^^^
[11:41:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet
[11:42:29] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: Correct behaviour when handling multiple regexes in path [deployment-charts] - 10https://gerrit.wikimedia.org/r/862851 (https://phabricator.wikimedia.org/T317326)
[11:42:46] <wikibugs>	 (03CR) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond)
[11:43:30] <_joe_>	 volans: the latency? yes, you can blame jayme 
[11:43:37] <volans>	 yes
[11:43:39] <_joe_>	 it's his software causing it
[11:43:44] <volans>	 :D
[11:44:08] <wikibugs>	 (03CR) 10Vgutierrez: "dstat --varnish-hit currently works as expected. Ema took care of varnish-be missing in 6c89146a832b0290f00de9123e8531dd6e71b600. Do we ha" [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall)
[11:44:48] <wikibugs>	 (03Abandoned) 10Jbond: do not merge: test CI [puppet] - 10https://gerrit.wikimedia.org/r/862848 (https://phabricator.wikimedia.org/T95377) (owner: 10Jbond)
[11:45:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42164 and previous config saved to /var/cache/conftool/dbconfig/20221201-114555-ladsgroup.json
[11:46:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1
[11:47:33] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1
[11:48:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:50:01] <wikibugs>	 (03CR) 10Vgutierrez: "on the other hand, dstat --varnishstat is currently broken and it varnish-be references to be removed to fix it" [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall)
[11:50:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui)
[11:50:45] <wikibugs>	 (03CR) 10FNegri: WIP: idea for cloud cumin::target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond)
[11:50:53] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] api-gateway: Correct behaviour when handling multiple regexes in path [deployment-charts] - 10https://gerrit.wikimedia.org/r/862851 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[11:51:29] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro)
[11:53:31] <claime>	 volans: You can ack it for another week if it's too noisy (the LIST services latency)
[11:53:41] <wikibugs>	 (03CR) 10FNegri: "Can you add an example of a use case that is fixed by this patch?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro)
[11:53:47] <claime>	 Or I can do it idc
[11:54:44] <claime>	 Heh, it's silenced but only for eqiad
[11:54:46] <claime>	 Fixing
[11:55:45] <claime>	 Done, I bumped the silence another 24h and removed the site matcher
[11:55:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1
[11:56:13] <wikibugs>	 (03CR) 10Vgutierrez: "and as a side effect if this CR gets merged, T277910 could be closed" [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall)
[11:57:15] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5004.eqsin.wmnet to cluster eqsin and group 1
[11:57:34] <wikibugs>	 (03PS1) 10Jbond: CI - puppet-lint: Add puppet-lint-param-docs [puppet] - 10https://gerrit.wikimedia.org/r/862855 (https://phabricator.wikimedia.org/T127797)
[11:57:36] <wikibugs>	 (03PS1) 10Jbond: do not merge: test puppet-lint-param-docs [puppet] - 10https://gerrit.wikimedia.org/r/862856 (https://phabricator.wikimedia.org/T127797)
[11:57:38] <wikibugs>	 (03PS1) 10Jbond: do not merge: CI should no longer complain [puppet] - 10https://gerrit.wikimedia.org/r/862857
[11:58:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] do not merge: test puppet-lint-param-docs [puppet] - 10https://gerrit.wikimedia.org/r/862856 (https://phabricator.wikimedia.org/T127797) (owner: 10Jbond)
[11:59:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] do not merge: CI should no longer complain [puppet] - 10https://gerrit.wikimedia.org/r/862857 (owner: 10Jbond)
[12:01:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318605)', diff saved to https://phabricator.wikimedia.org/P42165 and previous config saved to /var/cache/conftool/dbconfig/20221201-120102-ladsgroup.json
[12:01:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[12:01:06] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[12:01:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[12:01:34] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Documentation, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797 (10jbond) I would be +1 for joe's suggestion.  The above patches add this and show what the output looks like.    > This way we don't force...
[12:02:40] <wikibugs>	 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10serviceops: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10LSobanski)
[12:03:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] postgresql: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/862260 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[12:04:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Nahid) Apologies for the trouble. I forgot that I needed two keys. Here's the new one:    ` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC0YqFoY0FU966/s5yB2JSzP3II2kGfSH5sMLfa5xrt...
[12:07:57] <wikibugs>	 (03PS1) 10Marostegui: data.yaml: Replace nahidunlimited key [puppet] - 10https://gerrit.wikimedia.org/r/862860 (https://phabricator.wikimedia.org/T324197)
[12:09:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui) a:05Nahid→03Marostegui
[12:11:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Syntax LGTM, I trust you got the key via a secure method ;)" [puppet] - 10https://gerrit.wikimedia.org/r/862860 (https://phabricator.wikimedia.org/T324197) (owner: 10Marostegui)
[12:11:56] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] data.yaml: Replace nahidunlimited key [puppet] - 10https://gerrit.wikimedia.org/r/862860 (https://phabricator.wikimedia.org/T324197) (owner: 10Marostegui)
[12:13:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42166 and previous config saved to /var/cache/conftool/dbconfig/20221201-121301-ladsgroup.json
[12:13:05] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[12:14:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] "It is now verified" [puppet] - 10https://gerrit.wikimedia.org/r/862860 (https://phabricator.wikimedia.org/T324197) (owner: 10Marostegui)
[12:15:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: nahidunlimited with same SSH password for WMCS and production - https://phabricator.wikimedia.org/T324197 (10Marostegui) 05Open→03Resolved ssh key verified and replaced. Please allow 30-60 minutes for it to totally spread across the fleet.
[12:24:57] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:27:31] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service Btullis T323783 - host being brought into service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:28:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42167 and previous config saved to /var/cache/conftool/dbconfig/20221201-122807-ladsgroup.json
[12:29:44] <wikibugs>	 (03PS5) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282)
[12:30:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Hiera settings for second bookworm puppetdb pair [puppet] - 10https://gerrit.wikimedia.org/r/862256 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[12:30:15] <wikibugs>	 (03CR) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova)
[12:31:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova)
[12:34:09] <wikibugs>	 (03PS4) 10Jbond: cumin::target: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483)
[12:34:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T323907)', diff saved to https://phabricator.wikimedia.org/P42168 and previous config saved to /var/cache/conftool/dbconfig/20221201-123430-ladsgroup.json
[12:34:35] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[12:34:48] <wikibugs>	 (03PS5) 10Jbond: cumin::target: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483)
[12:35:29] <wikibugs>	 (03CR) 10Jbond: cumin::target: idea for cloud cumin::target (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond)
[12:36:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cumin::target: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond)
[12:37:29] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10MoritzMuehlenhoff) ganeti5004 has been added to the eqsin Ganeti cluster.
[12:39:00] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: Correct behaviour when handling multiple regexes in path [deployment-charts] - 10https://gerrit.wikimedia.org/r/862851 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[12:40:33] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "I'm not able to do an actual review on this, sorry. Just some questions. This is meant to be manually executed, right? How does one know w" [puppet] - 10https://gerrit.wikimedia.org/r/752748 (https://phabricator.wikimedia.org/T218097) (owner: 10MSantos)
[12:42:54] <wikibugs>	 (03Abandoned) 10Hnowlan: thumbor: correct tinyrgb path [deployment-charts] - 10https://gerrit.wikimedia.org/r/860628 (https://phabricator.wikimedia.org/T323775) (owner: 10Hnowlan)
[12:43:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P42169 and previous config saved to /var/cache/conftool/dbconfig/20221201-124314-ladsgroup.json
[12:43:42] <moritzm>	 !log installing glibc security updates on buster
[12:43:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:19] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Correct behaviour when handling multiple regexes in path [deployment-charts] - 10https://gerrit.wikimedia.org/r/862851 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[12:45:48] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+2] cookbooks: print out instructions on next step after updating the            buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe)
[12:46:56] <icinga-wm>	 PROBLEM - puppet last run on idp-test1002 is CRITICAL: CRITICAL: Puppet has been disabled for 604896 seconds, message: jmm test, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:47:39] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[12:47:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[12:48:16] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
[12:48:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cookbooks: print out instructions on next step after updating the            buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe)
[12:48:47] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
[12:49:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42170 and previous config saved to /var/cache/conftool/dbconfig/20221201-124936-ladsgroup.json
[12:49:40] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
[12:50:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
[12:50:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Very cool! Thank you for the quick action on this" [puppet] - 10https://gerrit.wikimedia.org/r/862304 (https://phabricator.wikimedia.org/T324091) (owner: 10Clément Goubert)
[12:50:26] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
[12:50:30] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
[12:52:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:53:07] <wikibugs>	 (03PS1) 10Muehlenhoff: superset: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862883 (https://phabricator.wikimedia.org/T135991)
[12:55:09] <wikibugs>	 (03PS2) 10Jbond: idp::standalone: add django oidc app [puppet] - 10https://gerrit.wikimedia.org/r/862328
[12:55:47] <wikibugs>	 (03PS1) 10Muehlenhoff: hue: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862885 (https://phabricator.wikimedia.org/T135991)
[12:57:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] idp::standalone: add django oidc app [puppet] - 10https://gerrit.wikimedia.org/r/862328 (owner: 10Jbond)
[12:58:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42171 and previous config saved to /var/cache/conftool/dbconfig/20221201-125821-ladsgroup.json
[12:58:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[12:58:25] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[12:58:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[13:00:47] <wikibugs>	 (03PS1) 10Muehlenhoff: yarn: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862886 (https://phabricator.wikimedia.org/T135991)
[13:04:19] <wikibugs>	 (03CR) 10FNegri: "Thanks John, looks good to me! Are you gonna create the private key and store it in the private repo, or do you want me to do it?" [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond)
[13:04:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P42172 and previous config saved to /var/cache/conftool/dbconfig/20221201-130443-ladsgroup.json
[13:06:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) >>! In T324101#8432262, @Ottomata wrote: >> Reason for access: need query search usage via jupyter for Structured Data pipelines > I'm...
[13:06:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie)
[13:06:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie)
[13:07:19] <wikibugs>	 (03PS3) 10Jbond: idp::standalone: add django oidc app [puppet] - 10https://gerrit.wikimedia.org/r/862328
[13:09:11] <wikibugs>	 (03CR) 10Jbond: cumin::target: idea for cloud cumin::target (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond)
[13:09:17] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] cumin::target: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond)
[13:14:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 148 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:16:36] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:19:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Ottomata) Ah okay, approved for analytics-privatedata-users and ssh and kerberos then.  You can use this ticket for the Kerberos access too.
[13:19:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T323907)', diff saved to https://phabricator.wikimedia.org/P42174 and previous config saved to /var/cache/conftool/dbconfig/20221201-131950-ladsgroup.json
[13:19:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[13:19:54] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[13:19:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[13:20:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42175 and previous config saved to /var/cache/conftool/dbconfig/20221201-132000-ladsgroup.json
[13:24:20] <wikibugs>	 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata)
[13:28:32] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[13:28:54] <wikibugs>	 (03PS1) 10Jaime Nuche: scap.cfg: enable K8s deployments in prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/862892
[13:28:57] <wikibugs>	 (03Abandoned) 10Dzahn: scap: move firewall rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/862378 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn)
[13:30:37] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adjust DNS for LVS eqsin. - cmooney@cumin1001"
[13:38:44] <wikibugs>	 (03PS1) 10Volans: Revert "setup.py: add temporary upper limit for pylint" [cookbooks] - 10https://gerrit.wikimedia.org/r/862364
[13:39:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "setup.py: add temporary upper limit for pylint" [cookbooks] - 10https://gerrit.wikimedia.org/r/862364 (owner: 10Volans)
[13:43:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp::standalone: add django oidc app [puppet] - 10https://gerrit.wikimedia.org/r/862328 (owner: 10Jbond)
[13:43:56] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder)
[13:49:21] <wikibugs>	 10SRE-Access-Requests, 10DBA: mariadb: grant user 'phstats' additional select on phabricator_search db - https://phabricator.wikimedia.org/T324205 (10Aklapper) p:05Triage→03Low
[13:50:16] <wikibugs>	 (03PS1) 10Aklapper: mariadb: grant user 'phstats' additional select on phabricator_search db [puppet] - 10https://gerrit.wikimedia.org/r/862895 (https://phabricator.wikimedia.org/T324205)
[13:53:26] <wikibugs>	 (03PS3) 10Elukey: ml-services: enable multi-processing for ml-staging editquality-goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[13:58:51] <wikibugs>	 (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging editquality-goodfaith (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[13:58:58] <wikibugs>	 (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging editquality-goodfaith (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[13:59:18] <wikibugs>	 (03PS2) 10Kosta Harlan: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526)
[13:59:27] <wikibugs>	 (03PS8) 10Kosta Harlan: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686)
[13:59:33] <wikibugs>	 (03PS2) 10Kosta Harlan: GrowthExperiments: Enable new impact module on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862840 (https://phabricator.wikimedia.org/T323686)
[13:59:36] <wikibugs>	 (03PS7) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526)
[14:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1400)
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1400).
[14:00:04] <jouncebot>	 Sohom_Datta and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:13] <Lucas_WMDE>	 o/
[14:00:14] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adjust DNS for LVS eqsin. - cmooney@cumin1001"
[14:00:14] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:00:21] <kostajh>	 hi
[14:00:42] <wikibugs>	 (03PS5) 10Eevans: Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis)
[14:01:08] <wikibugs>	 (03CR) 10Volans: "And ofc there is a new issue with the new release, opened https://github.com/PyCQA/prospector/issues/545 upstream" [cookbooks] - 10https://gerrit.wikimedia.org/r/862364 (owner: 10Volans)
[14:01:17] <kostajh>	 I can self-serve my deploys
[14:01:28] <Lucas_WMDE>	 sure, go ahead
[14:01:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862355 (https://phabricator.wikimedia.org/T324188) (owner: 10Kosta Harlan)
[14:12:40] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis)
[14:12:50] <wikibugs>	 (03PS2) 10BBlack: docker_registry_ha: remove unused cache::nodes ref [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762)
[14:12:53] <wikibugs>	 (03PS1) 10BBlack: Remove some old utility files [puppet] - 10https://gerrit.wikimedia.org/r/862903
[14:13:19] <wikibugs>	 (03PS2) 10BBlack: Remove some old utility files [puppet] - 10https://gerrit.wikimedia.org/r/862903
[14:13:32] <wikibugs>	 10SRE, 10Analytics-Clusters, 10Analytics-Radar, 10Data-Engineering-Planning, and 2 others: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10EChetty)
[14:14:54] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Remove some old utility files [puppet] - 10https://gerrit.wikimedia.org/r/862903 (owner: 10BBlack)
[14:15:53] <wikibugs>	 (03Abandoned) 10BBlack: cipher_sim.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670985 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[14:16:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr) @Papaul  they are R440 we are missing configG in netbox
[14:18:17] <wikibugs>	 (03PS1) 10Jbond: P:idp::standalone: update requirements and add second vhost [puppet] - 10https://gerrit.wikimedia.org/r/862927
[14:19:16] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10MoritzMuehlenhoff)
[14:19:55] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10MoritzMuehlenhoff)
[14:20:01] <wikibugs>	 (03Merged) 10jenkins-bot: DatabaseUserImpactStore: Fix parameter style for upsert keys [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/862355 (https://phabricator.wikimedia.org/T324188) (owner: 10Kosta Harlan)
[14:20:28] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:20:29] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:862355|DatabaseUserImpactStore: Fix parameter style for upsert keys (T324188)]]
[14:20:32] <stashbot>	 T324188: Wikimedia\Rdbms\Platform\SQLPlatform::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays - https://phabricator.wikimedia.org/T324188
[14:20:37] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729
[14:21:38] <logmsgbot>	 !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:862355|DatabaseUserImpactStore: Fix parameter style for upsert keys (T324188)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[14:21:46] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:22:15] <kostajh>	 reviewing on mwdebug1002
[14:22:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto)
[14:22:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:23:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10EChetty)
[14:23:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:23:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:24:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:24:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10EChetty) Merging the Kerberos request into this ticket.
[14:25:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:idp::standalone: update requirements and add second vhost [puppet] - 10https://gerrit.wikimedia.org/r/862927 (owner: 10Jbond)
[14:25:50] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[14:26:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[14:27:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[14:27:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:27:20] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] admin_ng: set thumbor max memory limit higher [deployment-charts] - 10https://gerrit.wikimedia.org/r/862230 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[14:27:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:27:32] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729
[14:27:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T323907)', diff saved to https://phabricator.wikimedia.org/P42176 and previous config saved to /var/cache/conftool/dbconfig/20221201-142735-ladsgroup.json
[14:27:39] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[14:27:54] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:862355|DatabaseUserImpactStore: Fix parameter style for upsert keys (T324188)]] (duration: 07m 25s)
[14:27:57] <stashbot>	 T324188: Wikimedia\Rdbms\Platform\SQLPlatform::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays - https://phabricator.wikimedia.org/T324188
[14:28:10] <kostajh>	 alright, on to the next one
[14:28:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergő Tisza)
[14:28:22] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:28:51] <wikibugs>	 (03PS2) 10Kosta Harlan: [no-op] GrowthExperiments: Enable D3 in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergő Tisza)
[14:28:55] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergő Tisza)
[14:29:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:29:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto)
[14:29:40] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:29:50] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[14:29:50] <wikibugs>	 (03Merged) 10jenkins-bot: [no-op] GrowthExperiments: Enable D3 in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) (owner: 10Gergő Tisza)
[14:29:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:29:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:29:58] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro)
[14:30:13] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:861506|[no-op] GrowthExperiments: Enable D3 in production (T318854)]]
[14:30:16] <stashbot>	 T318854: Application Security Review Request : d3.js - https://phabricator.wikimedia.org/T318854
[14:30:16] <stashbot>	 D3: test - ignore - https://phabricator.wikimedia.org/D3
[14:30:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:31:19] <logmsgbot>	 !log kharlan@deploy1002 kharlan and tgr: Backport for [[gerrit:861506|[no-op] GrowthExperiments: Enable D3 in production (T318854)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[14:32:07] <wikibugs>	 (03PS3) 10Kosta Harlan: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526)
[14:32:37] <kostajh>	 syncing
[14:32:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move traffic rules to 'ops' instance [puppet] - 10https://gerrit.wikimedia.org/r/861825 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi)
[14:33:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro)
[14:35:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:36:17] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:861506|[no-op] GrowthExperiments: Enable D3 in production (T318854)]] (duration: 06m 04s)
[14:36:20] <stashbot>	 T318854: Application Security Review Request : d3.js - https://phabricator.wikimedia.org/T318854
[14:36:20] <stashbot>	 D3: test - ignore - https://phabricator.wikimedia.org/D3
[14:36:27] <kostajh>	 on to the last one
[14:36:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Superset [puppet] - 10https://gerrit.wikimedia.org/r/862933 (https://phabricator.wikimedia.org/T135991)
[14:37:31] <wikibugs>	 (03PS4) 10Kosta Harlan: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526)
[14:37:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan)
[14:38:53] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Enable new impact module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862839 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan)
[14:39:15] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:862839|GrowthExperiments: Enable new impact module on testwiki (T323526)]]
[14:39:19] <stashbot>	 T323526: New Impact Module: Start experiment for the new Impact module on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T323526
[14:39:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: kubernetes: output envs/services separated by newlines [puppet] - 10https://gerrit.wikimedia.org/r/862935 (https://phabricator.wikimedia.org/T324091)
[14:40:23] <logmsgbot>	 !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:862839|GrowthExperiments: Enable new impact module on testwiki (T323526)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[14:41:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hiera: replace thanos-sso with thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913)
[14:41:12] <kostajh>	 syncing
[14:41:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hiera: replace thanos-sso with thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi)
[14:42:06] <XioNoX>	 !log add BGP sessions to RIPE RIS in drmrs
[14:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:34] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hiera: replace thanos-sso with thanos-web [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913)
[14:42:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:42:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:42:48] <wikibugs>	 (03PS1) 10Clément Goubert: kube-env: Add completion for alias [puppet] - 10https://gerrit.wikimedia.org/r/862938 (https://phabricator.wikimedia.org/T324091)
[14:43:43] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] kubernetes: output envs/services separated by newlines [puppet] - 10https://gerrit.wikimedia.org/r/862935 (https://phabricator.wikimedia.org/T324091) (owner: 10Filippo Giunchedi)
[14:44:16] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh)
[14:44:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmnet: remove thanos-sso [dns] - 10https://gerrit.wikimedia.org/r/862939 (https://phabricator.wikimedia.org/T323913)
[14:44:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] kubernetes: output envs/services separated by newlines [puppet] - 10https://gerrit.wikimedia.org/r/862935 (https://phabricator.wikimedia.org/T324091) (owner: 10Filippo Giunchedi)
[14:44:55] <godog>	 jbond: merged your changes to labs/private
[14:45:02] * godog high fives claime 
[14:45:17] <jbond>	 godog: thanks
[14:45:28] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:862839|GrowthExperiments: Enable new impact module on testwiki (T323526)]] (duration: 06m 12s)
[14:45:31] <stashbot>	 T323526: New Impact Module: Start experiment for the new Impact module on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T323526
[14:45:44] <wikibugs>	 (03PS2) 10Clément Goubert: kube-env: Add completion for alias [puppet] - 10https://gerrit.wikimedia.org/r/862938 (https://phabricator.wikimedia.org/T324091)
[14:46:01] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[14:46:04] * claime high fives godog
[14:46:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:46:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] kube-env: Add completion for alias [puppet] - 10https://gerrit.wikimedia.org/r/862938 (https://phabricator.wikimedia.org/T324091) (owner: 10Clément Goubert)
[14:46:44] <kostajh>	 I don't think Sohom_Datta is around, so skipping their patch
[14:46:51] <kostajh>	 unless Lucas_WMDE you think we should do it?
[14:47:01] <Lucas_WMDE>	 I haven’t looked at the patch yet
[14:47:30] <kostajh>	 Lucas_WMDE: it has a +1 from Jdlrobson
[14:48:17] <kostajh>	 hi Sohom_Datta, we were just discussing your patch
[14:48:20] <Lucas_WMDE>	 hi Sohom_Datta!
[14:48:27] <Sohom_Datta>	 Hi
[14:48:28] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] kube-env: Add completion for alias [puppet] - 10https://gerrit.wikimedia.org/r/862938 (https://phabricator.wikimedia.org/T324091) (owner: 10Clément Goubert)
[14:48:53] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[14:49:01] <Lucas_WMDE>	 kostajh: do you want to deploy it or should I?
[14:49:07] <kostajh>	 Lucas_WMDE: are you able to take over?
[14:49:12] <Lucas_WMDE>	 I can, sure
[14:49:16] <wikibugs>	 (03PS1) 10Jbond: idp::standalon: Add OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/862942 (https://phabricator.wikimedia.org/T311999)
[14:49:23] <kostajh>	 Lucas_WMDE: then, I am off to the kindergarten. danke!
[14:49:26] <Lucas_WMDE>	 ok!
[14:50:12] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Enable limited width on plwikisource MAIN namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta)
[14:50:16] <wikibugs>	 (03PS1) 10Ssingh: lvs5004: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048)
[14:50:24] <moritzm>	 !log installing krb5 security updates
[14:50:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:26] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "diffConfig looks good to me (effectively removes ns0 from the setting)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta)
[14:50:44] <wikibugs>	 (03PS34) 10Andrew Bogott: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[14:51:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:51:22] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add lvs5004 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/862944 (https://phabricator.wikimedia.org/T322048)
[14:51:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta)
[14:52:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:52:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:52:17] <wikibugs>	 (03Merged) 10jenkins-bot: Enable limited width on plwikisource MAIN namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861431 (https://phabricator.wikimedia.org/T323185) (owner: 10Sohom Datta)
[14:52:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:52:42] <wikibugs>	 (03PS1) 10Volans: setup.py: temporary upper limit to prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/862945
[14:52:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:861431|Enable limited width on plwikisource MAIN namespace (T323185)]]
[14:52:49] <stashbot>	 T323185: Enabled limited width preference disables limited width in Wikisource main namespace - https://phabricator.wikimedia.org/T323185
[14:53:12] <wikibugs>	 (03PS1) 10Ssingh: lvs5001: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/862946 (https://phabricator.wikimedia.org/T323830)
[14:53:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42177 and previous config saved to /var/cache/conftool/dbconfig/20221201-145337-ladsgroup.json
[14:53:40] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[14:53:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and soda: Backport for [[gerrit:861431|Enable limited width on plwikisource MAIN namespace (T323185)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[14:53:59] <wikibugs>	 (03PS2) 10Ssingh: lvs5001: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/862946 (https://phabricator.wikimedia.org/T323830)
[14:54:31] <Lucas_WMDE>	 Sohom_Datta: the change should be on mwdebug, can you test it?
[14:54:35] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "unblocking CI" [cookbooks] - 10https://gerrit.wikimedia.org/r/862945 (owner: 10Volans)
[14:54:45] <Sohom_Datta>	 yeah, checking :)
[14:54:49] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[14:54:52] <Lucas_WMDE>	 ok :)
[14:54:53] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38549/console" [puppet] - 10https://gerrit.wikimedia.org/r/862946 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[14:55:45] <Sohom_Datta>	 Yep works fine :)
[14:56:16] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: temporary upper limit to prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/862945 (owner: 10Volans)
[14:56:19] <Lucas_WMDE>	 ok \o/
[14:56:28] <wikibugs>	 (03PS8) 10Volans: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 (owner: 10Giuseppe Lavagetto)
[14:57:43] <wikibugs>	 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) Sorry for lack of update. I did dig into the Gerrit cache which are backed up by H2 Database. Some...
[14:57:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:58:08] <wikibugs>	 (03PS2) 10Jbond: idp::standalon: Add OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/862942 (https://phabricator.wikimedia.org/T311999)
[14:58:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:58:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:59:15] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 25): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38548/console" [puppet] - 10https://gerrit.wikimedia.org/r/862942 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond)
[14:59:24] <wikibugs>	 (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[14:59:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:00:25] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns5001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/862321 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[15:00:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:861431|Enable limited width on plwikisource MAIN namespace (T323185)]] (duration: 08m 06s)
[15:00:56] <stashbot>	 T323185: Enabled limited width preference disables limited width in Wikisource main namespace - https://phabricator.wikimedia.org/T323185
[15:01:03] <wikibugs>	 (03Merged) 10jenkins-bot: sites.yaml: remove dns5001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/862321 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[15:01:30] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[15:01:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:40] <Lucas_WMDE>	 only a minute over time ;)
[15:01:41] <Lucas_WMDE>	 jouncebot: now
[15:01:41] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 58 minute(s)
[15:01:45] <Lucas_WMDE>	 ok :)
[15:02:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp::standalon: Add OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/862942 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond)
[15:02:46] <wikibugs>	 10SRE, 10Cassandra, 10RESTBase-Cassandra: setup an alertable threshold for Cassandra heap dumps - https://phabricator.wikimedia.org/T106346 (10LSobanski)
[15:07:23] <wikibugs>	 (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[15:07:52] <wikibugs>	 (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[15:08:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] lvs5004: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[15:08:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P42178 and previous config saved to /var/cache/conftool/dbconfig/20221201-150843-ladsgroup.json
[15:09:22] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] lvs5001: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/862946 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[15:10:06] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[15:10:34] <sukhe>	 !log homer "cr*-eqsin*" commit "running homer for Gerrit: 862321"
[15:10:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:41] <wikibugs>	 (03PS1) 10Jbond: idp::standalon: use production value for oidc_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/862948
[15:11:56] <sukhe>	 !log [done] homer "cr*-eqsin*" commit "running homer for Gerrit: 862321"
[15:11:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:26] <effie>	 !log php7.4 upgrade + apache upgrade + rolling restarts of app servers - T323358
[15:12:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:31] <wikibugs>	 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (GitLab III: GitLab in LA 🪃), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) @Jelto thank you for the excellent analysis about monitoring. I will look at integrating that to th...
[15:13:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp::standalon: use production value for oidc_endpoint [puppet] - 10https://gerrit.wikimedia.org/r/862948 (owner: 10Jbond)
[15:19:20] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P42179 and previous config saved to /var/cache/conftool/dbconfig/20221201-152350-ladsgroup.json
[15:24:07] <wikibugs>	 (03PS7) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[15:25:10] <wikibugs>	 (03PS1) 10Jbond: idp::standalone: correct name of local_settings [puppet] - 10https://gerrit.wikimedia.org/r/862950
[15:25:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] idp::standalone: correct name of local_settings [puppet] - 10https://gerrit.wikimedia.org/r/862950 (owner: 10Jbond)
[15:26:45] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns5001 [puppet] - 10https://gerrit.wikimedia.org/r/862316 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[15:26:55] <wikibugs>	 (03PS2) 10Ssingh: hiera: decommission dns5001 [puppet] - 10https://gerrit.wikimedia.org/r/862316 (https://phabricator.wikimedia.org/T323830)
[15:27:04] <wikibugs>	 (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[15:28:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns5001.wikimedia.org
[15:30:53] <wikibugs>	 (03PS8) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[15:31:00] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware, 10User-fgiunchedi: decommission graphite2003.codfw.wmnet - https://phabricator.wikimedia.org/T323718 (10Papaul) 05Open→03Resolved a:03Papaul Complete
[15:31:19] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+1] mediawiki: Extend /portals max-age from 24h to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) (owner: 10Krinkle)
[15:31:23] <wikibugs>	 (03PS3) 10Marostegui: Add mlitn to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/862245 (https://phabricator.wikimedia.org/T324101) (owner: 10Matthias Mullie)
[15:34:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[15:34:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui) @matthiasmullie do you want to also add yourself to `analytics-privatedata-users` in the gerrit patch? Once done I can merge and add the k...
[15:35:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: grant user 'phstats' additional select on phabricator_search db [puppet] - 10https://gerrit.wikimedia.org/r/862895 (https://phabricator.wikimedia.org/T324205) (owner: 10Aklapper)
[15:35:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:36:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[15:36:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10DBA, 10Patch-For-Review: mariadb: grant user 'phstats' additional select on phabricator_search db - https://phabricator.wikimedia.org/T324205 (10Marostegui) 05Open→03Resolved a:03Marostegui Merged and applied the grants. Please test it and reopen if it is not working!...
[15:37:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054']
[15:38:12] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[15:38:12] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:38:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns5001.wikimedia.org
[15:38:21] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns5001.wikimedia.org` - dns5001.wikimedia....
[15:38:23] <wikibugs>	 (03PS1) 10Jbond: idp::standalone: config is not a hash [puppet] - 10https://gerrit.wikimedia.org/r/862954
[15:38:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T323907)', diff saved to https://phabricator.wikimedia.org/P42180 and previous config saved to /var/cache/conftool/dbconfig/20221201-153856-ladsgroup.json
[15:38:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance
[15:39:00] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[15:39:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance
[15:39:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T323907)', diff saved to https://phabricator.wikimedia.org/P42181 and previous config saved to /var/cache/conftool/dbconfig/20221201-153918-ladsgroup.json
[15:39:21] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh)
[15:40:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:41:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp::standalone: config is not a hash [puppet] - 10https://gerrit.wikimedia.org/r/862954 (owner: 10Jbond)
[15:41:39] <effie>	 !log php7.4 upgrade + apache upgrade + rolling restarts of api servers - T323358
[15:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] P:cache::haproxy: harden systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh)
[15:45:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1055']
[15:46:25] <wikibugs>	 (03PS1) 10Jbond: idp::standlone: corrcet file name [puppet] - 10https://gerrit.wikimedia.org/r/862957
[15:46:56] <wikibugs>	 (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[15:47:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1054']
[15:48:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] idp::standlone: corrcet file name [puppet] - 10https://gerrit.wikimedia.org/r/862957 (owner: 10Jbond)
[15:48:22] <wikibugs>	 (03PS9) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[15:50:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054']
[15:51:26] <wikibugs>	 (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[15:52:09] <wikibugs>	 (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[15:52:38] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging revscoring-editquality-goodfaith model (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[15:53:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) I am already part of `analytics-privatedata-users` (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/862245/3/modules/admin/da...
[15:53:39] <wikibugs>	 (03PS10) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[15:54:49] <wikibugs>	 (03PS1) 10Papaul: Add new cloudvirt node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/862958 (https://phabricator.wikimedia.org/T313983)
[15:55:34] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add new cloudvirt node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/862958 (https://phabricator.wikimedia.org/T313983) (owner: 10Papaul)
[15:56:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Add mlitn to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/862245 (https://phabricator.wikimedia.org/T324101) (owner: 10Matthias Mullie)
[15:57:16] <effie>	 !log php7.4 upgrade + apache upgrade + rolling restarts of jobrunners/videoscalers servers - T323358
[15:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:41] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui)
[15:57:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1056']
[15:58:05] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1055']
[15:58:16] <wikibugs>	 (03PS11) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[15:59:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T323907)', diff saved to https://phabricator.wikimedia.org/P42182 and previous config saved to /var/cache/conftool/dbconfig/20221201-155917-ladsgroup.json
[15:59:22] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[15:59:53] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui) 05Open→03Resolved a:03Marostegui I have merged your patch. Also, you should've gotten an email about your kerberos principal. Please...
[16:00:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1055']
[16:00:53] <effie>	 !log php7.4 upgrade + apache upgrade + rolling restarts of parsoid servers - T323358
[16:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1054']
[16:07:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bullseye
[16:07:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1054.eqiad.wmn...
[16:08:17] <wikibugs>	 (03PS12) 10Ilias Sarantopoulos: ml-services: enable multi-processing for ml-staging model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624)
[16:09:10] <wikibugs>	 (03CR) 10Elukey: ml-services: enable multi-processing for ml-staging model servers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[16:10:28] <wikibugs>	 (03PS7) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793)
[16:12:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey)
[16:13:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1056']
[16:13:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1055']
[16:14:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P42183 and previous config saved to /var/cache/conftool/dbconfig/20221201-161424-ladsgroup.json
[16:15:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: enable multi-processing for ml-staging model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/861345 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[16:17:26] <wikibugs>	 (03PS1) 10Jbond: idp::standalone: correct permissions [puppet] - 10https://gerrit.wikimedia.org/r/862967
[16:19:28] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add partman rules for cloudvirt10[54-61] [puppet] - 10https://gerrit.wikimedia.org/r/862406 (https://phabricator.wikimedia.org/T313983) (owner: 10Andrew Bogott)
[16:19:42] <wikibugs>	 (03PS2) 10Papaul: Add partman rules for cloudvirt10[54-61] [puppet] - 10https://gerrit.wikimedia.org/r/862406 (https://phabricator.wikimedia.org/T313983) (owner: 10Andrew Bogott)
[16:20:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp::standalone: correct permissions [puppet] - 10https://gerrit.wikimedia.org/r/862967 (owner: 10Jbond)
[16:21:21] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846 (10Krinkle)
[16:22:18] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10serviceops-collab: Redirect revisions from svn.wikimedia.org to https://static-codereview.wikimedia.org - https://phabricator.wikimedia.org/T119846 (10Krinkle) I've boldly updated the task description to suggest targetting <https://static-codereview.wikimedia.org> instead...
[16:25:08] <wikibugs>	 (03PS8) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793)
[16:26:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey)
[16:26:29] <wikibugs>	 (03PS9) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793)
[16:26:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1056']
[16:28:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T323907)', diff saved to https://phabricator.wikimedia.org/P42184 and previous config saved to /var/cache/conftool/dbconfig/20221201-162815-ladsgroup.json
[16:28:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS bullseye
[16:28:19] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[16:28:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1055.eqiad.wmn...
[16:29:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P42185 and previous config saved to /var/cache/conftool/dbconfig/20221201-162930-ladsgroup.json
[16:30:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul)
[16:31:40] <wikibugs>	 (03PS10) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793)
[16:32:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) a:05Papaul→03None
[16:34:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1057']
[16:36:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage
[16:36:30] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/862937 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi)
[16:37:46] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1307 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[16:37:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10DBA: mariadb: grant user 'phstats' additional select on phabricator_search db - https://phabricator.wikimedia.org/T324205 (10Aklapper) Works. Thank you!
[16:38:59] <wikibugs>	 (03PS11) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793)
[16:39:10] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Rewrite as kubernetes operator/controller [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[16:39:19] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] update vendor [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861353 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[16:39:34] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage
[16:40:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage
[16:41:48] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] helm-state-metrics: Update to v0.2.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/862173 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[16:42:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1056']
[16:43:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42187 and previous config saved to /var/cache/conftool/dbconfig/20221201-164322-ladsgroup.json
[16:43:35] <moritzm>	 !log installing ini4j security updates
[16:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:09] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage
[16:44:32] <wikibugs>	 (03PS12) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793)
[16:44:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T323907)', diff saved to https://phabricator.wikimedia.org/P42188 and previous config saved to /var/cache/conftool/dbconfig/20221201-164437-ladsgroup.json
[16:44:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[16:44:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] helm-state-metrics: Update to v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/862174 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[16:44:45] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[16:45:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[16:45:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T323907)', diff saved to https://phabricator.wikimedia.org/P42189 and previous config saved to /var/cache/conftool/dbconfig/20221201-164509-ladsgroup.json
[16:45:30] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Thank you both for the review." [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh)
[16:45:44] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:cache::haproxy: harden systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh)
[16:46:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10aborrero) note, because {T319184} these hosts only use 1 single network interface. It should be the default in puppet.  We need a particular sw...
[16:46:41] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[16:47:42] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https:/
[16:47:42] <icinga-wm>	 h.wikimedia.org/wiki/Wikifeeds
[16:48:37] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5004 fix - robh@cumin2002"
[16:48:44] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[16:49:14] <wikibugs>	 (03Merged) 10jenkins-bot: helm-state-metrics: Update to v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/862174 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[16:49:42] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5004 fix - robh@cumin2002"
[16:49:42] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:50:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1057']
[16:50:31] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns5004
[16:50:56] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns5004
[16:51:24] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10MoritzMuehlenhoff)
[16:52:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:53:39] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1054.eqiad.wmnet with OS bullseye
[16:53:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1054.eqiad.wmnet with OS bullseye comple...
[16:55:59] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[16:56:38] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:57:22] <wikibugs>	 (03PS1) 10Vivian Rook: k8s [puppet] - 10https://gerrit.wikimedia.org/r/862992
[16:58:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42190 and previous config saved to /var/cache/conftool/dbconfig/20221201-165828-ladsgroup.json
[16:58:42] <wikibugs>	 (03Abandoned) 10Vivian Rook: k8s [puppet] - 10https://gerrit.wikimedia.org/r/862992 (owner: 10Vivian Rook)
[16:58:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1055.eqiad.wmnet with OS bullseye
[16:59:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1055.eqiad.wmnet with OS bullseye comple...
[16:59:29] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1057']
[17:00:05] <jouncebot>	 jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1700).
[17:00:05] <jouncebot>	 tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:35] <rzl>	 tgr_: hi, looking
[17:01:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[17:01:29] * jbond steps away unless needed
[17:01:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1056.eqiad.wmnet with OS bullseye
[17:02:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1056.eqiad.wmnet with OS bullseye
[17:02:52] <rzl>	 tgr_: looks straightforward, running pcc as a formality then I'll merge :) will you want me to kick off a test run of any of these?
[17:02:54] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[17:03:53] <tgr_>	 rzl: no, thanks, we'll do a test run later today when wmf.12 is live
[17:03:56] <rzl>	 👍
[17:04:06] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38550/console" [puppet] - 10https://gerrit.wikimedia.org/r/861964 (https://phabricator.wikimedia.org/T323958) (owner: 10Gergő Tisza)
[17:04:39] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1 C: 03+2] growthexperiments: Use min edit limit for user impact refresh [puppet] - 10https://gerrit.wikimedia.org/r/861964 (https://phabricator.wikimedia.org/T323958) (owner: 10Gergő Tisza)
[17:05:23] <wikibugs>	 (03PS1) 10Vivian Rook: aptrepo: add thirdparty/kubeadm-k8s-1-2[34] [puppet] - 10https://gerrit.wikimedia.org/r/862994
[17:07:21] <wikibugs>	 (03PS1) 10Ssingh: dns5004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/862996 (https://phabricator.wikimedia.org/T322048)
[17:07:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1058']
[17:08:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1059']
[17:08:13] <rzl>	 all set
[17:08:52] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add dns5004 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/862998 (https://phabricator.wikimedia.org/T322048)
[17:11:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 199 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:13:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:13:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns5004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/862996 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[17:13:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T323907)', diff saved to https://phabricator.wikimedia.org/P42191 and previous config saved to /var/cache/conftool/dbconfig/20221201-171335-ladsgroup.json
[17:13:40] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[17:14:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage
[17:14:55] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns5004.wikimedia.org with OS buster
[17:15:05] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster
[17:16:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul)
[17:18:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage
[17:21:47] <wikibugs>	 (03PS1) 10Volans: setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003
[17:21:49] <wikibugs>	 (03PS1) 10Volans: spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401)
[17:22:33] <wikibugs>	 (03PS1) 10Jbond: hiera: add oidc endpoint to apero_cas global [puppet] - 10https://gerrit.wikimedia.org/r/863005
[17:22:35] <wikibugs>	 (03PS1) 10Jbond: apereo_cas: add OidcRegisteredService service support [puppet] - 10https://gerrit.wikimedia.org/r/863006
[17:23:38] <wikibugs>	 (03PS2) 10Jbond: hiera: add oidc endpoint to apero_cas global [puppet] - 10https://gerrit.wikimedia.org/r/863005
[17:23:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/863005 (owner: 10Jbond)
[17:24:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1059']
[17:24:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] hiera: add oidc endpoint to apero_cas global [puppet] - 10https://gerrit.wikimedia.org/r/863005 (owner: 10Jbond)
[17:25:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1058']
[17:26:33] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1058']
[17:26:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T323907)', diff saved to https://phabricator.wikimedia.org/P42192 and previous config saved to /var/cache/conftool/dbconfig/20221201-172634-ladsgroup.json
[17:26:37] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[17:27:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1059']
[17:29:12] <wikibugs>	 (03PS1) 10Sharvaniharan: Merge branch 'master' into HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863008
[17:29:24] <wikibugs>	 (03CR) 10David Caro: quota_increase: Fix issue with dashed quota names (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro)
[17:29:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Merge branch 'master' into HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863008 (owner: 10Sharvaniharan)
[17:29:32] <wikibugs>	 (03Abandoned) 10Sharvaniharan: Merge branch 'master' into HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863008 (owner: 10Sharvaniharan)
[17:29:40] <wikibugs>	 (03PS1) 10Papaul: Add new sretest codfw node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/863009 (https://phabricator.wikimedia.org/T322578)
[17:30:25] <wikibugs>	 (03PS3) 10David Caro: quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294
[17:31:29] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1057']
[17:32:29] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1056.eqiad.wmnet with OS bullseye
[17:32:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1056.eqiad.wmnet with OS bullseye comple...
[17:32:44] <wikibugs>	 (03CR) 10David Caro: Revert "setup.py: add temporary upper limit for pylint" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/862364 (owner: 10Volans)
[17:33:25] <wikibugs>	 (03CR) 10Volans: "This is my proposal for the external modules injection and the possibility to inject additional accessors into the Spicerack instance." [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[17:33:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1057']
[17:33:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro)
[17:33:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add dns5004 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/862998 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[17:34:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1060']
[17:36:38] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1057']
[17:37:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul)
[17:38:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1057.eqiad.wmnet with OS bullseye
[17:38:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1057.eqiad.wmnet with OS bullseye
[17:40:32] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1059']
[17:40:59] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1058']
[17:41:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P42193 and previous config saved to /var/cache/conftool/dbconfig/20221201-174140-ladsgroup.json
[17:42:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10AnnWF)
[17:42:15] <wikibugs>	 (03PS1) 10Sharvaniharan: New configs for android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011
[17:42:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1058.eqiad.wmnet with OS bullseye
[17:42:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1058.eqiad.wmnet with OS bullseye
[17:43:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10AnnWF)
[17:44:38] <wikibugs>	 (03CR) 10Sharvaniharan: "Hi @Ottomata @Bearloga" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan)
[17:44:38] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1059.eqiad.wmnet with OS bullseye
[17:44:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1059.eqiad.wmnet with OS bullseye
[17:44:58] <wikibugs>	 (03CR) 10BCornwall: varnish: Remove unused dstat plugins (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall)
[17:45:56] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1060']
[17:46:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1060']
[17:47:06] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1060']
[17:47:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans)
[17:47:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[17:47:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5004.wikimedia.org with reason: host reimage
[17:47:41] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10daniel) Note that we only need active purging if/when we emit cache control headers that tell the edge case to cache long-term.  One k...
[17:50:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1060']
[17:50:56] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1060']
[17:51:24] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5004.wikimedia.org with reason: host reimage
[17:51:25] <wikibugs>	 (03PS2) 10BCornwall: varnish: Remove unused dstat plugins [puppet] - 10https://gerrit.wikimedia.org/r/862371
[17:53:48] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38551/console" [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall)
[17:55:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage
[17:55:50] <wikibugs>	 (03PS2) 10Ssingh: lvs5004: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048)
[17:56:14] <wikibugs>	 (03CR) 10Ssingh: "rebased, no code change" [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[17:56:34] <wikibugs>	 (03CR) 10Bking: snapshot: Parallelize cirrus dumps by db shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson)
[17:56:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P42194 and previous config saved to /var/cache/conftool/dbconfig/20221201-175647-ladsgroup.json
[17:57:09] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage
[17:57:42] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[17:57:52] <wikibugs>	 (03CR) 10Ebernhardson: "o" [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson)
[17:58:05] <sukhe>	 recursive DNS is me, will be fixed shortly
[17:58:17] <sukhe>	 (reimaging in progress)
[17:58:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage
[17:59:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs5004: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/862943 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[18:00:04] <jouncebot>	 bd808: How many deployers does it take to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1800).
[18:01:29] <bd808>	 I will be deploying a new developer-portal build today. We've got some new translations and also have updated some of the static site generation libraries.
[18:01:33] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage
[18:01:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5004.eqsin.wmnet with OS buster
[18:01:45] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5004.eqsin.wmnet with OS buster
[18:01:51] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222)
[18:02:10] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] scap.cfg: enable K8s deployments in prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/862892 (owner: 10Jaime Nuche)
[18:02:39] <wikibugs>	 (03CR) 10Bking: [C: 03+2] snapshot: Apply minor cleanups to cirrus dump script [puppet] - 10https://gerrit.wikimedia.org/r/856653 (owner: 10Ebernhardson)
[18:04:06] <wikibugs>	 (03CR) 10Bking: snapshot: Parallelize cirrus dumps by db shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson)
[18:04:31] <wikibugs>	 (03CR) 10Bking: [C: 03+2] snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson)
[18:04:47] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container version to 2022-12-01-121802-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/863014
[18:04:52] <wikibugs>	 (03PS7) 10Bking: snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson)
[18:10:09] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1058.eqiad.wmnet with OS bullseye
[18:10:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1058.eqiad.wmnet with OS bullseye comple...
[18:10:37] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container version to 2022-12-01-121802-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/863014 (owner: 10BryanDavis)
[18:11:31] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1060']
[18:11:39] <icinga-wm>	 PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICALError response or zero answers: https://wikitech.wikimedia.org/wiki/DNS
[18:11:53] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1060']
[18:11:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T323907)', diff saved to https://phabricator.wikimedia.org/P42195 and previous config saved to /var/cache/conftool/dbconfig/20221201-181153-ladsgroup.json
[18:11:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[18:11:57] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[18:12:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[18:12:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T323907)', diff saved to https://phabricator.wikimedia.org/P42196 and previous config saved to /var/cache/conftool/dbconfig/20221201-181215-ladsgroup.json
[18:12:47] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1060.eqiad.wmnet with OS bullseye
[18:12:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1060.eqiad.wmnet with OS bullseye
[18:14:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061']
[18:14:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:15:38] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container version to 2022-12-01-121802-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/863014 (owner: 10BryanDavis)
[18:16:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1059.eqiad.wmnet with OS bullseye
[18:16:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1059.eqiad.wmnet with OS bullseye comple...
[18:16:45] <wikibugs>	 (03CR) 10BPirkle: [C: 03+1] api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222) (owner: 10Hnowlan)
[18:16:55] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:17:22] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222) (owner: 10Hnowlan)
[18:17:31] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:17:38] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:19:25] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[18:19:41] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[18:19:42] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222) (owner: 10Hnowlan)
[18:21:16] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[18:23:42] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: define restbase match correctly [deployment-charts] - 10https://gerrit.wikimedia.org/r/863013 (https://phabricator.wikimedia.org/T324222) (owner: 10Hnowlan)
[18:25:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:25:18] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[18:25:35] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[18:26:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
[18:27:11] <rzl>	 jouncebot: nowandnext
[18:27:11] <jouncebot>	 For the next 0 hour(s) and 32 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1800)
[18:27:11] <jouncebot>	 In 0 hour(s) and 32 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1900)
[18:27:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
[18:27:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
[18:27:38] <rzl>	 starting to decom mw[1307-1348], they'll be fully depooled when the train deploy starts, so no conflict
[18:27:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
[18:29:52] * bd808 is done with the tech engagement deploy window
[18:30:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul)
[18:30:34] <dancy>	 rzl: Would it be possible to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/862892 now ?
[18:31:02] <rzl>	 dancy: sure but give me a few minutes on this decom first :)
[18:31:16] <dancy>	 ok.. no rush.  Thank you!
[18:31:39] <wikibugs>	 (03CR) 10Herron: "sketching out this approach with a simple panel layout initially, interested in your notes" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[18:33:23] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Fix autorestart and debclient dependency - https://phabricator.wikimedia.org/T324229 (10jbond) p:05Triage→03Medium
[18:34:37] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1057.eqiad.wmnet with OS bullseye
[18:34:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1057.eqiad.wmnet with OS bullseye execut...
[18:35:57] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[18:36:36] <rzl>	 jhathaway, brett: meant to highlight you as oncall, fyi decomming mw[1307-1348]
[18:36:58] <brett>	 thanks!
[18:37:00] <logmsgbot>	 !log rzl@cumin2002 conftool action : set/pooled=no; selector: name=mw13(0[7-9]|[1-3]\d|4[0-8])\..*
[18:37:05] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 197 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:38:05] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1061']
[18:38:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1057.eqiad.wmnet with OS bullseye
[18:38:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1057.eqiad.wmnet with OS bullseye
[18:38:27] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[18:39:31] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:43:39] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 127 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:43:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061']
[18:46:35] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:51:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage
[18:51:42] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] New configs for android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan)
[18:53:49] <wikibugs>	 (03CR) 10Ottomata: Add a new production images for spark and spark-operator (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[18:54:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:55:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage
[18:57:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T323907)', diff saved to https://phabricator.wikimedia.org/P42197 and previous config saved to /var/cache/conftool/dbconfig/20221201-185742-ladsgroup.json
[18:57:46] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[19:00:04] <jouncebot>	 dancy and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T1900).
[19:00:43] <dancy>	 Hello!  I'm going to deploy a new release of scap before rolling the train 
[19:01:39] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 364 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:01:45] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.30.0" for 601 hosts
[19:01:46] <wikibugs>	 10SRE, 10Dependency-Tracking, 10Wikibase-Quality-Constraints, 10Wikidata, and 2 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10Eevans)
[19:02:17] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.30.0" completed for 601 hosts
[19:02:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage
[19:02:59] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:05:55] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863020 (https://phabricator.wikimedia.org/T320517)
[19:05:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863020 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot)
[19:06:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage
[19:06:39] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863020 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot)
[19:06:47] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326)
[19:06:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans)
[19:07:18] <wikibugs>	 (03CR) 10Hnowlan: "Sample output for dev change here: https://phabricator.wikimedia.org/P42198" [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326) (owner: 10Hnowlan)
[19:07:27] <rzl>	 dancy: sorry, had to step away for a moment -- am I too late for that scap.cfg change to be helpful?
[19:08:01] <dancy>	 yeah, I specified the option manually for this train operation.  You're good to deploy after train finishes (~3 minutes)
[19:08:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[19:08:27] <rzl>	 ah, apologies -- let me know when, and I'll go ahead
[19:08:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[19:08:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[19:08:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[19:08:42] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: add option to remove part of url path [deployment-charts] - 10https://gerrit.wikimedia.org/r/863021 (https://phabricator.wikimedia.org/T317326)
[19:08:56] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8078353, @Cmjohnson wrote: > @Eevans take your time, I just want to make sure that we're not falling behind on-site. Let me know whenever you're ready...
[19:09:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1057.eqiad.wmnet with OS bullseye
[19:09:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1057.eqiad.wmnet with OS bullseye comple...
[19:12:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P42199 and previous config saved to /var/cache/conftool/dbconfig/20221201-191248-ladsgroup.json
[19:13:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[19:14:02] <rzl>	 won't merge this until the train is finished:
[19:14:09] <wikibugs>	 (03PS1) 10RLazarus: scap: Replace proxies that are being decommed [puppet] - 10https://gerrit.wikimedia.org/r/863022 (https://phabricator.wikimedia.org/T306162)
[19:15:54] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1061']
[19:16:37] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.12  refs T320517
[19:16:40] <stashbot>	 T320517: 1.40.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T320517
[19:16:45] <dancy>	 rzl: done!
[19:18:28] <wikibugs>	 (03PS1) 10Ssingh: hiera: add dns5004.yaml [puppet] - 10https://gerrit.wikimedia.org/r/863024
[19:19:55] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: add dns5004.yaml [puppet] - 10https://gerrit.wikimedia.org/r/863024 (owner: 10Ssingh)
[19:20:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:20:24] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] scap.cfg: enable K8s deployments in prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/862892 (owner: 10Jaime Nuche)
[19:20:52] <rzl>	 dancy: merged, want a manual run anywhere?
[19:20:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[19:21:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[19:21:07] <dancy>	 I'll run sync-world now 
[19:21:21] <rzl>	 oh of course :) 👍
[19:21:37] <wikibugs>	 (03CR) 10Ahmon Dancy: "FYI I ran "touch /var/lib/deploy-mwdebug/pause" on deploy1002 and left the file in place." [puppet] - 10https://gerrit.wikimedia.org/r/862892 (owner: 10Jaime Nuche)
[19:21:51] <logmsgbot>	 !log dancy@deploy1002 Started scap: testing k8s deployment
[19:22:10] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1060.eqiad.wmnet with OS bullseye
[19:22:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1060.eqiad.wmnet with OS bullseye comple...
[19:25:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[19:25:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061']
[19:26:22] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] scap: Replace proxies that are being decommed [puppet] - 10https://gerrit.wikimedia.org/r/863022 (https://phabricator.wikimedia.org/T306162) (owner: 10RLazarus)
[19:27:16] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs5004.eqsin.wmnet with OS buster
[19:27:23] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1061']
[19:27:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul)
[19:27:34] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5004.eqsin.wmnet with OS buster executed with errors: - lvs5004 (...
[19:27:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[19:27:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P42200 and previous config saved to /var/cache/conftool/dbconfig/20221201-192755-ladsgroup.json
[19:28:08] <logmsgbot>	 !log dancy@deploy1002 Finished scap: testing k8s deployment (duration: 06m 17s)
[19:28:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061']
[19:35:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:35:19] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1061']
[19:37:32] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[19:38:18] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1061']
[19:38:46] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:39:40] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns5004.wikimedia.org with OS buster
[19:39:50] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster executed with errors: - dns5004...
[19:40:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 202 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:40:14] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns5004.wikimedia.org with OS buster
[19:40:24] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster
[19:41:22] <wikibugs>	 (03PS1) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026
[19:41:30] <mutante>	 !log gitlab2002 (gitlab-replica) - upgrading gitlab-ce
[19:41:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:20] <wikibugs>	 (03PS2) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026
[19:42:47] <logmsgbot>	 !log rzl@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 42 hosts with reason: decom
[19:42:56] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans)
[19:43:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T323907)', diff saved to https://phabricator.wikimedia.org/P42201 and previous config saved to /var/cache/conftool/dbconfig/20221201-194301-ladsgroup.json
[19:43:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:43:04] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:43:05] <stashbot>	 T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907
[19:43:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:43:47] <logmsgbot>	 !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 42 hosts with reason: decom
[19:43:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6c20a9fc-5041-4ab7-bed4-f80a2643f954) set by rzl@cumin2002 for 1 day, 0:00:00 on 42 host(s) and their se...
[19:44:34] <logmsgbot>	 !log rzl@cumin2002 conftool action : set/pooled=inactive; selector: name=mw13(0[7-9]|[1-3]\d|4[0-8])\..*
[19:44:53] <mutante>	 !log gitlab-runner1002 - upgrading gitlab-runner package
[19:44:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 318 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:46:27] <wikibugs>	 (03CR) 10Eevans: [C: 04-1] "Not yet ready." [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans)
[19:47:40] <icinga-wm>	 PROBLEM - Host 2001:df2:e500:1:103:102:166:8 is DOWN: PING CRITICAL - Packet loss = 100%
[19:47:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:49:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:49:50] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:50:40] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[19:50:50] <icinga-wm>	 RECOVERY - MD RAID on ganeti2013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:52:26] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:53:42] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt1061']
[19:56:08] <icinga-wm>	 PROBLEM - Host 2001:df2:e500:1:103:102:166:8 is DOWN: CRITICAL - Destination Unreachable (2001:df2:e500:1:103:102:166:8)
[19:56:32] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1061.eqiad.wmnet with OS bullseye
[19:56:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt1061.eqiad.wmnet with OS bullseye
[19:59:19] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:59:48] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version https://phabricator.wikmiedia.org/T324195
[20:00:01] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version https://phabricator.wikmiedia.org/T324195
[20:02:37] <wikibugs>	 (03PS3) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[20:04:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:07:00] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] varnish: Remove unused dstat plugins [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall)
[20:09:05] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "(Thanks to Valentin for fixing the remaining issues with this CR.)" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[20:09:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:09:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage
[20:09:55] <wikibugs>	 (03PS5) 10Vgutierrez: setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[20:11:19] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:12:23] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[20:12:23] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:12:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[20:12:48] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5004.wikimedia.org with reason: host reimage
[20:12:59] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage
[20:14:31] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] setup.py: update dependencies for bullseye [software/acme-chief] - 10https://gerrit.wikimedia.org/r/860637 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[20:14:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:15:04] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:16:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5004.wikimedia.org with reason: host reimage
[20:16:28] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:17:07] <wikibugs>	 (03PS1) 10Vgutierrez: Release 0.36 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/863028 (https://phabricator.wikimedia.org/T321309)
[20:17:09] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/862908 (owner: 10Paladox)
[20:17:11] <wikibugs>	 (03PS4) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[20:17:13] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/862909 (owner: 10Paladox)
[20:17:23] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[20:17:23] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:17:45] <wikibugs>	 (03PS5) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[20:18:14] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:20:04] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:21:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:22:08] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [labs/private] - 10https://gerrit.wikimedia.org/r/862910 (owner: 10Paladox)
[20:22:23] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: (2) Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts  - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:22:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[20:23:13] <wikibugs>	 (03PS6) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[20:23:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:25:00] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:26:58] <wikibugs>	 (03PS7) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[20:27:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1061.eqiad.wmnet with OS bullseye
[20:27:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt1061.eqiad.wmnet with OS bullseye comple...
[20:28:08] <icinga-wm>	 PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:28:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:32:03] <wikibugs>	 (03PS8) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[20:33:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul)
[20:34:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) 05Open→03Resolved @Andrew all yours
[20:35:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:36:21] <wikibugs>	 (03CR) 10Ottomata: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[20:36:23] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[20:37:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[20:37:31] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 320 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:38:51] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:41:23] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[20:41:47] <wikibugs>	 (03Abandoned) 10Sharvaniharan: Stream configs for newly migrated android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783874 (https://phabricator.wikimedia.org/T306385) (owner: 10Sharvaniharan)
[20:41:54] <wikibugs>	 (03PS9) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[20:44:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:46:01] <wikibugs>	 (03PS10) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[20:47:17] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.remove-downtime for gitlab1004.wikimedia.org
[20:47:18] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for gitlab1004.wikimedia.org
[20:50:07] <wikibugs>	 (03PS3) 10Gergő Tisza: GrowthExperiments: Run refreshUserImpactData maintenance script in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan)
[20:50:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:51:50] <wikibugs>	 (03PS4) 10Gergő Tisza: GrowthExperiments: Enable user impact refresh script on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan)
[20:52:13] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:52:26] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:54:23] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[20:55:30] <wikibugs>	 (03CR) 10Gergő Tisza: GrowthExperiments: Enable user impact refresh script on pilot wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan)
[20:56:07] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 135 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:59:23] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[21:00:05] <jouncebot>	 brennen: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221201T2100). Please do the needful.
[21:00:05] <jouncebot>	 zabe, MatmaRex, and tgr: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:35] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:42] <MatmaRex>	 hi
[21:01:19] <MatmaRex>	 i'm requesting a few interesting operations today, so you might want to do everyone else first :)
[21:01:33] <brennen>	 o/
[21:02:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[21:02:44] <zabe>	 hey
[21:02:48] <wikibugs>	 (03PS11) 10Ryan Kemper: wdqs: add grizzly dashboard for uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[21:02:58] <rzl>	 brennen: fyi I'm still decomming mw1307-mw1348, but they're pooled=inactive so should be totally out of your hair
[21:03:15] <rzl>	 happy to wait if you'd prefer to avoid the noise though :)
[21:03:20] <tgr>	 o/
[21:04:00] <wikibugs>	 (03CR) 10Ryan Kemper: "Okay, this is ready for official review. With the latest iteration the request_sl*_query formulas are working properly (albeit perf-wise s" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[21:04:14] <brennen>	 rzl: no worries
[21:04:21] <rzl>	 👍
[21:04:25] <brennen>	 zabe, starting with yours since sharvani_ doesn't seem to be here
[21:04:33] <zabe>	 ok
[21:05:12] <wikibugs>	 (03PS2) 10Brennen Bearnes: Start writing to cul_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861853 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:05:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861853 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:06:09] <wikibugs>	 (03Merged) 10jenkins-bot: Start writing to cul_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861853 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:06:23] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:861853|Start writing to cul_actor on test wikis (T233004)]]
[21:06:27] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:07:09] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[21:08:08] <logmsgbot>	 !log brennen@deploy1002 brennen and zabe: Backport for [[gerrit:861853|Start writing to cul_actor on test wikis (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[21:08:45] <sharvani_>	 Hi Brennen .. here for the config patch deployment! 👋
[21:08:57] <brennen>	 hi sharvani_!  welcome.  just doing a patch for zabe then we'll get yours underway.
[21:09:04] <brennen>	 zabe: anything to test?
[21:09:15] <zabe>	 yes
[21:09:22] <sharvani_>	 Ty 🙂
[21:09:27] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[21:09:37] <zabe>	 brennen, could you do a query for me?
[21:10:17] <brennen>	 zabe: yeah, will figure it out - pestering thcipriani. :)
[21:10:24] <zabe>	 basically 'select * from cu_log limit 1' on testwiki
[21:10:33] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1307-1326].eqiad.wmnet
[21:10:41] <zabe>	 feel free to post the result to a wmf-nda protected paste if you prefer
[21:10:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[21:11:05] <zabe>	 I would like to see if the field is correctly being written to
[21:11:33] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:12:57] <brennen>	 zabe: https://phabricator.wikimedia.org/P42202
[21:13:22] <wikibugs>	 (03PS1) 10RLazarus: cumin: Replace the mw-jobrunner-canary which is about to be decommed [puppet] - 10https://gerrit.wikimedia.org/r/863032 (https://phabricator.wikimedia.org/T303162)
[21:13:28] <logmsgbot>	 !log rzl@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts mw[1307-1326].eqiad.wmnet
[21:13:43] <zabe>	 bah sorry, forgot the order by, could you 'select * from cu_log order by cul_id desc limit 1;'?
[21:14:29] <brennen>	 zabe: paste updated, looks more like what you're expecting i'd think
[21:15:01] <zabe>	 brennen, yep thanks and the actor id is correct, so lgtm
[21:15:11] <brennen>	 cool, syncing
[21:21:09] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] cumin: Replace the mw-jobrunner-canary which is about to be decommed [puppet] - 10https://gerrit.wikimedia.org/r/863032 (https://phabricator.wikimedia.org/T303162) (owner: 10RLazarus)
[21:21:20] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:861853|Start writing to cul_actor on test wikis (T233004)]] (duration: 14m 56s)
[21:21:24] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:21:38] <wikibugs>	 10SRE, 10SRE-Access-Requests: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) Thanks, all!
[21:21:41] <brennen>	 sharvani_: you're up next
[21:21:59] <sharvani_>	 Ready! thank you!
[21:22:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan)
[21:22:38] <wikibugs>	 (03PS2) 10Brennen Bearnes: New configs for android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan)
[21:22:50] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan)
[21:22:50] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1307-1326].eqiad.wmnet
[21:22:59] <wikibugs>	 (03PS7) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519)
[21:23:13] <zabe>	 thanks for your help :)
[21:23:37] <wikibugs>	 (03PS8) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519)
[21:23:56] <wikibugs>	 (03CR) 10Ottomata: flink and flink-kubernetes-operator image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[21:24:48] <wikibugs>	 (03Merged) 10jenkins-bot: New configs for android schemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863011 (owner: 10Sharvaniharan)
[21:25:03] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:863011|New configs for android schemas]]
[21:25:45] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:25:58] <andrewbogott>	 !log saving an image of wikitech-static-ord (aka wikitech-static) before upgrading the host to Buster
[21:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:09] <Reedy>	 party time
[21:26:22] <Reedy>	 andrewbogott: I'm around for MW stuff if it's unhappy after the OS upgrade :)
[21:26:33] <andrewbogott>	 thank you!
[21:26:48] <logmsgbot>	 !log brennen@deploy1002 brennen and sharvaniharan: Backport for [[gerrit:863011|New configs for android schemas]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:26:58] <brennen>	 sharvani_: on mwdebug servers - please test
[21:27:10] <sharvani_>	 testing now..
[21:28:26] <sharvani_>	 seeing it! thank you! 
[21:28:30] <brennen>	 cool, syncing
[21:28:41] <sharvani_>	 🙌
[21:29:54] <andrewbogott>	 boy it is taking a surprisingly long time to make this backup
[21:30:00] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Enable user impact refresh script on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan)
[21:31:17] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[21:34:46] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[21:34:52] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:863011|New configs for android schemas]] (duration: 09m 49s)
[21:35:56] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[21:36:05] <wikibugs>	 (03PS5) 10Brennen Bearnes: GrowthExperiments: Enable user impact refresh script on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan)
[21:36:23] <sharvani_>	 Thanks for deploying Brennen!
[21:36:24] <brennen>	 tgr: yrs next
[21:36:27] <brennen>	 sure thing sharvani_ 
[21:36:42] <tgr>	 brennen: thanks! no need to test
[21:36:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan)
[21:37:00] <brennen>	 tgr: cool
[21:38:16] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Enable user impact refresh script on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859568 (https://phabricator.wikimedia.org/T322541) (owner: 10Kosta Harlan)
[21:38:29] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:859568|GrowthExperiments: Enable user impact refresh script on pilot wikis (T322541)]]
[21:38:32] <stashbot>	 T322541: UserImpact: Set up maintenance script to run in betalabs and production - https://phabricator.wikimedia.org/T322541
[21:40:13] <logmsgbot>	 !log brennen@deploy1002 brennen and kharlan: Backport for [[gerrit:859568|GrowthExperiments: Enable user impact refresh script on pilot wikis (T322541)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[21:42:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:43:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:43:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:43:48] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[21:45:08] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Too late to fix it now, but for the record the commit message should have referenced T306162, not 303162." [puppet] - 10https://gerrit.wikimedia.org/r/863032 (https://phabricator.wikimedia.org/T303162) (owner: 10RLazarus)
[21:46:17] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:859568|GrowthExperiments: Enable user impact refresh script on pilot wikis (T322541)]] (duration: 07m 48s)
[21:46:21] <stashbot>	 T322541: UserImpact: Set up maintenance script to run in betalabs and production - https://phabricator.wikimedia.org/T322541
[21:46:48] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[21:47:48] <brennen>	 MatmaRex: couple questions - do you have the necessary access for your stuff in beta?  and how long do you estimate for that maintenance script in prod?  i've never run this.
[21:48:25] <MatmaRex>	 hi
[21:48:42] <MatmaRex>	 brennen: i don't have access to beta, as far as i know
[21:48:56] <brennen>	 do you want access to beta?
[21:49:21] <MatmaRex>	 eeeeh not really
[21:49:30] <Reedy>	 lol
[21:49:34] <brennen>	 c'mon
[21:49:38] <MatmaRex>	 heh
[21:49:42] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:49:48] <MatmaRex>	 i don't know how long the script will take, but on the order of days
[21:50:05] <MatmaRex>	 i guess we're almost out of time in the window today, i can reschedule this. it's not urgent
[21:50:14] <brennen>	 sounds good. :)
[21:50:15] <wikibugs>	 (03PS6) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591)
[21:51:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse)
[21:51:18] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[21:51:18] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:51:50] <MatmaRex>	 brennen: so how would i actually get access to the beta cluster?
[21:52:27] <brennen>	 MatmaRex: can add you as an admin in horizon and you'd have root shell access.  seems like you know what you're about, so seems reasonable for you to do it. :)
[21:53:44] <MatmaRex>	 my only worry is that i'll get pinged one day when beta goes down for no reason ;)
[21:54:56] <brennen>	 there are hundreds of folks with this access
[21:55:09] <brennen>	 (most of them studiously ignoring what happens in beta, if i'm any example)
[22:01:27] <brennen>	 !log end of utc late backport & config window
[22:01:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:08] <cwhite>	 !log restart swift-proxy on thanos::frontend eqiad
[22:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:13] <wikibugs>	 (03PS3) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026
[22:03:20] <wikibugs>	 (03PS7) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591)
[22:04:05] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans)
[22:04:06] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 206 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:04:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse)
[22:06:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 203 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:07:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:07:28] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.dns.netbox
[22:10:24] <wikibugs>	 (03PS4) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026
[22:14:26] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans)
[22:15:07] <wikibugs>	 (03CR) 10David Caro: "The only blocker here is creating the service on port 8000, and would be nice to address the other unresolved comments." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[22:16:34] <icinga-wm>	 PROBLEM - Disk space on aphlict1001 is CRITICAL: DISK CRITICAL - free space: / 667 MB (3% inode=91%): /tmp 667 MB (3% inode=91%): /var/tmp 667 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops
[22:16:52] <icinga-wm>	 PROBLEM - gdnsd daemon runs exactly once on dns5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 497 (gdnsd), args /usr/sbin/gdnsd https://wikitech.wikimedia.org/wiki/DNS
[22:17:04] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns5004 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[22:17:26] <icinga-wm>	 PROBLEM - AuthDNS-over-TLS Works on dns5004 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS
[22:17:34] <icinga-wm>	 PROBLEM - Auth DNS on dns5004 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[22:18:09] <wikibugs>	 (03PS8) 10Andrea Denisse: admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591)
[22:18:44] <sukhe>	 ^ dns5004 issues are expected, bblack and I are debugging
[22:18:47] <sukhe>	 please safely ignore for now
[22:18:59] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "PCC (no-op): https://puppet-compiler.wmflabs.org/output/863026/1486/" [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans)
[22:19:12] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1307-1326].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001"
[22:20:28] <rzl>	 sukhe: was about to ask, running the netbox cookbook (via the decom cookbook) also gave me some dns5004 errors but I assume those are similarly expected
[22:20:43] <rzl>	 I guess the real question is, am I okay to ignore them and continue, or should I not have started decomming in the first place :P
[22:21:16] <sukhe>	 rzl: out of curiosity, which host were you running the decom cookbook on?
[22:21:19] <sukhe>	 was it in eqsin?
[22:21:27] <rzl>	 no, mw1307-1326
[22:21:29] <sukhe>	 ok
[22:21:32] <sukhe>	 thank you
[22:21:35] <rzl>	 (with more to follow but all eqiad appservers)
[22:21:49] <sukhe>	 you can share the dns5004 errors with me
[22:21:57] <sukhe>	 and I will let you know if it's fine to continue or not
[22:21:58] <icinga-wm>	 PROBLEM - Check systemd state on dns5004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_gdnsd_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:23:46] <rzl>	 sukhe: sure, partial output was this https://www.irccloud.com/pastebin/YcgWE4EA/
[22:23:53] <rzl>	 lmk if that's not enough context
[22:24:12] <sukhe>	 oh yeah, you can ignore this
[22:24:22] <rzl>	 rad, thanks
[22:25:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:26:04] <icinga-wm>	 PROBLEM - gdnsd checkconf on dns5004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd
[22:26:25] <rzl>	 papaul: my decom cookbook wants to sync some of your netbox-hiera changes for cloudvirt hosts at the same time -- are they okay to merge, or should I hold off?
[22:27:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:28:32] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:30:08] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on cloudweb1003 is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static https://wikitech.wikimedia.org/wiki/Wikitech-static
[22:30:48] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49122 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:32:10] <andrewbogott>	 Reedy: I've upgraded things but it insists on running php7.3 even though 
[22:32:12] <andrewbogott>	 https://www.irccloud.com/pastebin/nbplO6wh/
[22:32:23] <andrewbogott>	 any guess which package I'm missing?
[22:33:21] <Reedy>	 andrewbogott: is PHP7.3 still installed, and $webserver config is still pointing at 7.3?
[22:34:10] <Reedy>	 Yeah, libapache2-mod-php7.3 is still installed
[22:34:16] <Reedy>	 (along with other 7.3 stuffs)
[22:34:26] <andrewbogott>	 yeah, if I remove it it breaks in other interesting ways.
[22:34:44] <wikibugs>	 (03PS1) 10Ssingh: hiera: temporarily remove references to dns5004 [puppet] - 10https://gerrit.wikimedia.org/r/863046 (https://phabricator.wikimedia.org/T322048)
[22:34:50] <Reedy>	 a2dismod php7.3 && a2enmod php7.4 && service apache2 restart
[22:34:51] <Reedy>	 or something?
[22:35:34] <Reedy>	 hmm, mods-enabled looks like it's 7.4
[22:35:42] <icinga-wm>	 RECOVERY - gdnsd daemon runs exactly once on dns5004 is OK: PROCS OK: 1 process with UID = 497 (gdnsd), args /usr/sbin/gdnsd https://wikitech.wikimedia.org/wiki/DNS
[22:35:44] <Reedy>	 just restart apache?
[22:35:57] <andrewbogott>	 a2enmod was the piece I was missing apparently
[22:36:12] <andrewbogott>	 I guess installing libapache2-mod-php7.4 doesn't do that?
[22:36:51] <wikibugs>	 (03PS2) 10Ssingh: hiera: temporarily remove references to dns5004 [puppet] - 10https://gerrit.wikimedia.org/r/863046 (https://phabricator.wikimedia.org/T322048)
[22:36:56] <andrewbogott>	 Anyway it seems better now!  thank you
[22:36:56] <Reedy>	 I think it will if there's no other php module enabled... but if there is, probably decides not to clobber the existing
[22:37:17] <andrewbogott>	 that makes sense... sort of
[22:37:46] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] hiera: temporarily remove references to dns5004 [puppet] - 10https://gerrit.wikimedia.org/r/863046 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[22:37:52] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: temporarily remove references to dns5004 [puppet] - 10https://gerrit.wikimedia.org/r/863046 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[22:38:06] <jinxer-wm>	 (ConfdResourceFailed) resolved: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:38:48] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:42:26] <andrewbogott>	 !log upgradedwikitech-static-ord (aka wikitech-static) to Debian Buster, installed php7.4, upgraded MW to 1_39. Will delete the rackspace backup image in a few days.
[22:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:28] <urbanecm>	 jouncebot: nowandnext
[22:48:28] <jouncebot>	 No deployments scheduled for the next 9 hour(s) and 11 minute(s)
[22:48:28] <jouncebot>	 In 9 hour(s) and 11 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221202T0800)
[22:48:36] <wikibugs>	 (03PS6) 10Urbanecm: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno)
[22:48:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "cleanup" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno)
[22:49:06] <logmsgbot>	 !log urbanecm@deploy1002 backport aborted:  (duration: 00m 03s)
[22:49:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno)
[22:50:11] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno)
[22:50:28] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:856008|GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue]]
[22:50:29] <wikibugs>	 (03PS5) 10Urbanecm: GrowthExperiments: Remove unused GEAllowAccessToNewImpactModule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan)
[22:50:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "cleanup, no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan)
[22:53:08] <icinga-wm>	 RECOVERY - Check systemd state on dns5004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:54:38] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1307-1326].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001"
[22:54:38] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:54:38] <logmsgbot>	 !log rzl@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw[1307-1326].eqiad.wmnet
[22:54:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by rzl@cumin1001 for hosts: `mw[1307-1326].eqiad.wmnet` - mw1307.eqiad.wmnet (**WARN**)   - Downtimed host...
[22:56:41] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] varnish: Remove unused dstat plugins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/862371 (owner: 10BCornwall)
[22:56:58] <rzl>	 !log rzl@puppetmaster1001:~$ sudo puppet node deactivate mw1312.eqiad.wmnet  # T306162
[22:57:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:01] <stashbot>	 T306162: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162
[22:57:04] <rzl>	 !log rzl@puppetmaster1001:~$ sudo puppet node deactivate mw1320.eqiad.wmnet  # T306162
[22:57:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:50] <icinga-wm>	 RECOVERY - gdnsd checkconf on dns5004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd
[22:57:56] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:856008|GrowthExperiments: Remove unused config variable GEMentorDashboardUseVue]] (duration: 07m 28s)
[22:59:39] <wikibugs>	 (03PS1) 10BCornwall: Remove since-deleted dstat plugin dir [puppet] - 10https://gerrit.wikimedia.org/r/863050
[22:59:43] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1327-1346].eqiad.wmnet
[23:01:44] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38553/console" [puppet] - 10https://gerrit.wikimedia.org/r/863050 (owner: 10BCornwall)
[23:03:08] <wikibugs>	 10Puppet, 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting, 10User-jbond: Port dstat related scripts to Python 3 - https://phabricator.wikimedia.org/T277910 (10BCornwall) 05Open→03Invalid No longer necessary since the scripts have been removed (See https://gerrit.wikimedia.org/r/c/operati...
[23:03:13] <wikibugs>	 10Puppet, 10SRE, 10SRE-tools, 10Infrastructure-Foundations, and 4 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10BCornwall)
[23:18:17] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[23:23:17] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[23:31:40] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.dns.netbox
[23:33:18] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[23:34:15] <icinga-wm>	 RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[23:34:25] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1327-1346].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001"
[23:34:27] <icinga-wm>	 RECOVERY - Auth DNS on dns5004 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[23:34:37] <icinga-wm>	 RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[23:35:46] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1327-1346].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001"
[23:35:46] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:35:47] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1327-1346].eqiad.wmnet
[23:35:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by rzl@cumin1001 for hosts: `mw[1327-1346].eqiad.wmnet` - mw1327.eqiad.wmnet (**WARN**)   - Downtimed host...
[23:37:29] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1347-1348].eqiad.wmnet
[23:39:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:43:45] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.dns.netbox
[23:44:09] <icinga-wm>	 RECOVERY - AuthDNS-over-TLS Works on dns5004 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS
[23:45:58] <logmsgbot>	 !log rzl@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1347-1348].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001"
[23:47:13] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1347-1348].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - rzl@cumin1001"
[23:47:13] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:47:14] <logmsgbot>	 !log rzl@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[1347-1348].eqiad.wmnet
[23:48:57] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns5004 is OK: OK: UP (pid=26982) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[23:53:18] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[23:54:16] <wikibugs>	 (03PS4) 10RLazarus: conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto)
[23:56:05] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] conftool: remove old mw servers [puppet] - 10https://gerrit.wikimedia.org/r/859966 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto)
[23:56:18] <wikibugs>	 (03PS4) 10RLazarus: site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto)
[23:58:15] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] site: remove old appservers [puppet] - 10https://gerrit.wikimedia.org/r/859967 (https://phabricator.wikimedia.org/T306162) (owner: 10Giuseppe Lavagetto)