[00:09:19] <wikibugs>	 (03PS1) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207
[00:11:24] <wikibugs>	 (03PS1) 10Dzahn: phabricator: rm code from before system user was created with systemd [puppet] - 10https://gerrit.wikimedia.org/r/865208
[00:16:49] <wikibugs>	 (03PS2) 10Dzahn: phabricator: rm code from before system user was created with systemd [puppet] - 10https://gerrit.wikimedia.org/r/865208
[00:18:51] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1023.eqiad.wmnet with OS bullseye
[00:18:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1...
[00:21:42] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1024.eqiad.wmnet with OS bullseye
[00:21:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1...
[00:45:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:47:01] <wikibugs>	 (03CR) 10Jberkel: Make "make" available in all images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel)
[00:50:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:00:35] <wikibugs>	 (03PS1) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318)
[01:31:49] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:34] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Enable profile::auto_restarts::service for Burrow [puppet] - 10https://gerrit.wikimedia.org/r/865114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[01:39:34] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) (owner: 10Filippo Giunchedi)
[01:41:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:51] <icinga-wm>	 PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:56:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:05] <icinga-wm>	 RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:21:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[02:33:45] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1011']
[02:35:29] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1c7c36c2e8: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[02:35:29] <icinga-wm>	 org/wiki/Search%23Administration
[02:36:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[02:39:36] <logmsgbot>	 !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1011']
[02:49:19] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1011.eqiad.wmnet with OS bullseye
[02:57:33] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[03:12:54] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:15:26] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1011.eqiad.wmnet with reason: host reimage
[03:18:34] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1011.eqiad.wmnet with reason: host reimage
[03:27:57] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[03:59:32] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1011.eqiad.wmnet with OS bullseye
[04:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:26:33] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:29] <wikibugs>	 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10dmaza) >>! In T320675#8368902, @Eevans wrote: > TL;DR I think it's OK if we fly by the seat of ou...
[04:43:57] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/865198
[05:29:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:34:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:43:37] <icinga-wm>	 PROBLEM - Host an-worker1108 is DOWN: PING CRITICAL - Packet loss = 100%
[05:46:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:47:11] <wikibugs>	 (03PS1) 10Marostegui: db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/865240
[05:48:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 1%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42433 and previous config saved to /var/cache/conftool/dbconfig/20221207-054759-root.json
[05:48:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/865240 (owner: 10Marostegui)
[05:49:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Phabricator, and 2 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui) That's ok Daniel, I will take care of it on this task.
[05:49:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Phabricator, and 2 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui) I will merge that change and then proceed and remove grants live
[05:49:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn)
[05:52:12] <wikibugs>	 (03PS1) 10Marostegui: mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/865241 (https://phabricator.wikimedia.org/T323418)
[05:52:27] <wikibugs>	 (03CR) 10Marostegui: "Daniel this required manual rebasing, so it was faster just to send a new patch:  https://gerrit.wikimedia.org/r/865241" [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn)
[05:55:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/865241 (https://phabricator.wikimedia.org/T323418) (owner: 10Marostegui)
[05:57:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui) ` root@db1159.eqiad.wmnet[(none)]> select user,host from mysql.user where host like '10.64.16.8'; +----------------+------------+ | User           | Host...
[05:58:04] <marostegui>	 !log Drop phab1001 grants from m3 databases T323418 
[05:58:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:07] <stashbot>	 T323418: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418
[06:00:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui) All done from the DBA side.
[06:03:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42434 and previous config saved to /var/cache/conftool/dbconfig/20221207-060305-root.json
[06:12:16] <wikibugs>	 10SRE, 10Data-Persistence, 10MediaWiki-extensions-SecurePoll, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), and 2 others: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Marostegui)
[06:18:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42435 and previous config saved to /var/cache/conftool/dbconfig/20221207-061810-root.json
[06:22:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:23:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Persistence (work done), 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui)
[06:32:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[06:33:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42436 and previous config saved to /var/cache/conftool/dbconfig/20221207-063316-root.json
[06:43:05] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Clarify db1206 isn't production ready [puppet] - 10https://gerrit.wikimedia.org/r/865387
[06:43:57] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:44:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Clarify db1206 isn't production ready [puppet] - 10https://gerrit.wikimedia.org/r/865387 (owner: 10Marostegui)
[06:44:33] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:48:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42437 and previous config saved to /var/cache/conftool/dbconfig/20221207-064821-root.json
[06:55:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:59:27] <wikibugs>	 10SRE, 10Data-Persistence, 10MediaWiki-extensions-SecurePoll, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), and 2 others: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Urbanecm) Thanks fo...
[07:03:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42438 and previous config saved to /var/cache/conftool/dbconfig/20221207-070326-root.json
[07:03:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) @sbassett I am not sure KHurd is the right user name, from what I can see there are two users with KHurd, there is KHurd and KHurd1, both created...
[07:12:54] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:18:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42439 and previous config saved to /var/cache/conftool/dbconfig/20221207-071831-root.json
[07:34:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF) Hey Marostegui,  I’d be somewhat happy to explain.   The first one in November was created but I had issues logging in, as at times it would show...
[07:36:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) @KHurd-WMF Thanks for the explanation. It is probably easier if you keep `KHurd1` then as it is associated to your wmf email account already. Cou...
[07:37:27] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF)
[07:38:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF) Done.  Thanks teammate,  Kelton Hurd Wikimedia Foundation - Security team khurd@wikimedia.org  {F35844006}
[07:42:13] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) So, `check_user` looks good and KHurd1 is associated to `khurd` WMF email account now.
[07:48:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF) Thank you. I appreciate you work on this.
[07:49:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF) 05Stalled→03Resolved a:03KHurd-WMF
[07:51:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) a:05KHurd-WMF→03None
[07:51:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) 05Resolved→03Open @KHurd-WMF this is not yet done - I was just verifying it is now fine and also added you to the  Phabricator group wmf-nda...
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:29] <urbanecm>	 indeed
[08:03:24] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10Jclark-ctr) @ayounsi Are you available to look at this today?
[08:19:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:19:32] <wikibugs>	 10SRE, 10Cloud-Services, 10observability, 10Sustainability (Incident Followup), and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff)
[08:23:31] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10ayounsi) I noticed that this interface is on FPC4 and {T304712} is about moving links away from FPC4, so better to move it while replacing the optic (and cleaning the patch).  We can for example move it to xe-3/2/2. Ping me a bi...
[08:24:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:35:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/865177 (https://phabricator.wikimedia.org/T324057) (owner: 10JHathaway)
[08:38:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:vopsbot: Notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/860625 (owner: 10Clément Goubert)
[08:40:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) JTAC case 2022-1207-600204 opened asking for an RMA as it's the 2nd time the issue happens.
[08:40:42] <wikibugs>	 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10Clement_Goubert) At your service o>
[08:44:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The image seems ok to me - just remember to add the user mapping in puppet too before building/publishing the image." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert)
[08:44:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "sigh, puppet." [puppet] - 10https://gerrit.wikimedia.org/r/864662 (https://phabricator.wikimedia.org/T324437) (owner: 10Clément Goubert)
[08:48:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for Burrow [puppet] - 10https://gerrit.wikimedia.org/r/865114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:49:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Burrow [puppet] - 10https://gerrit.wikimedia.org/r/865114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:50:20] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] C:vopsbot: Notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/860625 (owner: 10Clément Goubert)
[08:51:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:51:44] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Thanks for the effort! It's a good start. I did a first pass and left few comments. Feel free to ping me if you have any questions." [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[08:53:31] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] C:systemd::syslog: Do not filebucket logfiles [puppet] - 10https://gerrit.wikimedia.org/r/864662 (https://phabricator.wikimedia.org/T324437) (owner: 10Clément Goubert)
[08:56:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:56:14] <wikibugs>	 10SRE, 10Cloud-Services, 10observability, 10Sustainability (Incident Followup), and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10fgiunchedi) I'm in general favor of switching to openssl for rsyslog (and thank you for the deep dive investigation!), since in produ...
[08:59:47] <wikibugs>	 (03PS2) 10JMeybohm: KubernetesAPILatency: Remove special handling of LIST secret requests [alerts] - 10https://gerrit.wikimedia.org/r/864760 (https://phabricator.wikimedia.org/T323706)
[09:00:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] helm-state-metrics: Update resources for v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864759 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[09:01:36] <wikibugs>	 (03PS1) 10Jgiannelos: beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758)
[09:02:18] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ayounsi) FYI, there are outstanding Homer diffs for asw1-eqsin: `lang=diff [edit interfaces] -   ge-0/0/16 { -       description DISABLED; -       disable; -   } [edit interfaces xe-0...
[09:02:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos)
[09:03:07] <wikibugs>	 (03PS2) 10Jgiannelos: beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758)
[09:05:40] <wikibugs>	 (03Merged) 10jenkins-bot: helm-state-metrics: Update resources for v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864759 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[09:10:23] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 42
[09:10:35] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1108.eqiad.wmnet
[09:11:08] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1108.eqiad.wmnet
[09:12:30] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 42
[09:13:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 397715
[09:14:17] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 397715
[09:14:51] <wikibugs>	 (03CR) 10Physikerwelt: [C: 03+1] "If you have shell access you can test it with a simple curl, before merging." [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos)
[09:14:59] <wikibugs>	 (03PS3) 10Filippo Giunchedi: base: remove support for plaintext remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762)
[09:17:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 395570
[09:17:16] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 395570
[09:18:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] KubernetesAPILatency: Remove special handling of LIST secret requests [alerts] - 10https://gerrit.wikimedia.org/r/864760 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[09:19:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38610/console" [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) (owner: 10Filippo Giunchedi)
[09:19:43] <icinga-wm>	 RECOVERY - Host an-worker1108 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[09:19:52] <wikibugs>	 10SRE, 10Data-Persistence, 10MediaWiki-extensions-SecurePoll, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), and 2 others: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Reedy) It's probabl...
[09:20:36] <wikibugs>	 (03Merged) 10jenkins-bot: KubernetesAPILatency: Remove special handling of LIST secret requests [alerts] - 10https://gerrit.wikimedia.org/r/864760 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm)
[09:22:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] base: remove support for plaintext remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) (owner: 10Filippo Giunchedi)
[09:23:43] <logmsgbot>	 !log jiji@deploy1002 backport aborted:  (duration: 00m 18s)
[09:23:53] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 167, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:24:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[09:24:48] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (1/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[09:25:17] <logmsgbot>	 !log jiji@deploy1002 Started scap: Backport for [[gerrit:865117|ProductionServices: Use redis_misc servers for LockManager (1/6) (T267581)]]
[09:25:20] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[09:26:10] <wikibugs>	 (03CR) 10Jgiannelos: beta-cluster: Fix restbase mathoid URI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos)
[09:27:18] <wikibugs>	 (03PS3) 10Jgiannelos: beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758)
[09:27:19] <logmsgbot>	 !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865117|ProductionServices: Use redis_misc servers for LockManager (1/6) (T267581)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[09:31:54] <wikibugs>	 (03CR) 10Physikerwelt: beta-cluster: Fix restbase mathoid URI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos)
[09:33:31] <wikibugs>	 (03PS3) 10Volans: cluster::cloud_management: create new role [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401)
[09:33:33] <wikibugs>	 (03PS2) 10Volans: cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401)
[09:34:25] <logmsgbot>	 !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865117|ProductionServices: Use redis_misc servers for LockManager (1/6) (T267581)]] (duration: 09m 08s)
[09:34:28] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[09:34:47] <icinga-wm>	 PROBLEM - Host contint1001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:35:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[09:35:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[09:35:57] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[09:36:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[09:36:21] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[09:36:37] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[09:36:44] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[09:37:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[09:39:21] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert)
[09:40:27] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 7568
[09:41:10] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7568
[09:41:29] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45430
[09:41:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[09:41:59] <wikibugs>	 (03PS5) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581)
[09:42:40] <wikibugs>	 (03PS4) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (3/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581)
[09:42:44] <wikibugs>	 (03PS4) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581)
[09:42:46] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45430
[09:42:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 31800
[09:44:04] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 31800
[09:44:34] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 31800
[09:45:03] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 31800
[09:46:03] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 32098
[09:46:23] <wikibugs>	 (03PS5) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581)
[09:47:23] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[09:49:45] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 100, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:50:23] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 32098
[09:51:01] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16276
[09:51:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you for tacking this!" [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert)
[09:52:14] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16276
[09:52:23] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 138064
[09:53:21] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 138064
[09:54:17] <wikibugs>	 (03PS1) 10KarlBeecken: mobileapps: bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/865583
[09:55:44] <wikibugs>	 (03PS4) 10Volans: cluster::cloud_management: create new role [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401)
[09:59:04] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 13150
[09:59:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150
[09:59:43] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8932
[10:00:01] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 8932
[10:00:10] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35320
[10:00:53] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35320
[10:02:33] <icinga-wm>	 RECOVERY - Host contint1001 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[10:04:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:04:41] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] ProductionServices: Use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[10:04:56] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/865043 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[10:05:10] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16276
[10:05:13] <wikibugs>	 (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:05:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ProductionServices: Use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[10:06:18] <wikibugs>	 (03PS2) 10Stevemunene: Add an-presto1008-1015 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/865043 (https://phabricator.wikimedia.org/T323783)
[10:07:03] <logmsgbot>	 !log jiji@deploy1002 Started scap: Backport for [[gerrit:865118|ProductionServices: Use redis_misc servers for LockManager (2/6) (T267581)]]
[10:07:06] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[10:07:21] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[10:09:04] <logmsgbot>	 !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865118|ProductionServices: Use redis_misc servers for LockManager (2/6) (T267581)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[10:09:37] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:11:42] <wikibugs>	 10SRE-tools, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10ayounsi)
[10:12:18] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10ayounsi)
[10:12:40] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 16276
[10:12:53] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 714
[10:13:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one question inline" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:14:18] <wikibugs>	 (03CR) 10KarlBeecken: [C: 03+1] mobileapps: bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/865583 (owner: 10KarlBeecken)
[10:15:26] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos)
[10:15:57] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Add an-presto1008-1015 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/865043 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene)
[10:16:10] <wikibugs>	 (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:17:04] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 714
[10:17:07] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 40217
[10:17:31] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40217
[10:17:52] <logmsgbot>	 !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865118|ProductionServices: Use redis_misc servers for LockManager (2/6) (T267581)]] (duration: 10m 48s)
[10:17:55] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[10:22:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:23:26] <wikibugs>	 (03PS1) 10Hnowlan: restbase: fix deployment-prep services [puppet] - 10https://gerrit.wikimedia.org/r/865586
[10:23:40] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: decrease container memory limits to march constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/865587 (https://phabricator.wikimedia.org/T323624)
[10:23:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[10:24:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] cluster::cloud_management: create new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:24:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] restbase: fix deployment-prep services [puppet] - 10https://gerrit.wikimedia.org/r/865586 (owner: 10Hnowlan)
[10:24:52] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (3/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[10:25:15] <logmsgbot>	 !log jiji@deploy1002 Started scap: Backport for [[gerrit:865119|ProductionServices: Use redis_misc servers for LockManager (3/6) (T267581)]]
[10:25:19] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[10:26:01] <claime>	 !log rebooted contin1001.eqiad.wmnet
[10:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:12] <logmsgbot>	 !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865119|ProductionServices: Use redis_misc servers for LockManager (3/6) (T267581)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[10:29:34] <wikibugs>	 (03PS2) 10Hnowlan: restbase: fix deployment-prep services [puppet] - 10https://gerrit.wikimedia.org/r/865586
[10:30:15] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:32:21] <wikibugs>	 (03PS5) 10Volans: cluster::cloud_management: create new role [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401)
[10:32:23] <wikibugs>	 (03PS3) 10Volans: cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401)
[10:32:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[10:32:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:33:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:33:48] <wikibugs>	 (03PS1) 10JMeybohm: kubertenes: Fix naming typo [labs/private] - 10https://gerrit.wikimedia.org/r/865588
[10:33:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:34:22] <wikibugs>	 (03PS2) 10JMeybohm: kubertenes: Fix naming typo [labs/private] - 10https://gerrit.wikimedia.org/r/865588
[10:35:44] <logmsgbot>	 !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865119|ProductionServices: Use redis_misc servers for LockManager (3/6) (T267581)]] (duration: 10m 29s)
[10:35:48] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[10:36:41] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:37:18] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] kubertenes: Fix naming typo [labs/private] - 10https://gerrit.wikimedia.org/r/865588 (owner: 10JMeybohm)
[10:37:25] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: increase limitrange for containers/pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624)
[10:38:26] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: increase limitrange for containers/pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624)
[10:38:37] <wikibugs>	 (03Abandoned) 10Ilias Sarantopoulos: ml-services: decrease container memory limits to march constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/865587 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[10:39:33] <wikibugs>	 10SRE, 10Data-Persistence, 10MediaWiki-extensions-SecurePoll, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), and 2 others: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Ladsgroup) yeah, it...
[10:43:02] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cluster::cloud_management: create new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:43:06] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] yarn: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862886 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:43:10] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] hue: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862885 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:43:14] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Enable profile::auto_restarts::service for Superset [puppet] - 10https://gerrit.wikimedia.org/r/862933 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:43:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] analytics::refinery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858604 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[10:46:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM request for cloudcumin1001 - https://phabricator.wikimedia.org/T323516 (10Volans)
[10:46:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: (1) VM request for cumincloud2001 - https://phabricator.wikimedia.org/T323518 (10Volans)
[10:46:30] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[10:48:29] <wikibugs>	 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10serviceops-radar, 10Release-Engineering-Team (Radar): contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10hashar)
[10:49:15] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[10:49:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar)
[10:50:00] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.ganeti.makevm for new host cloudcumin1001.eqiad.wmnet
[10:50:01] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[10:50:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:50:59] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar)
[10:51:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar)
[10:51:19] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar)
[10:51:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar)
[10:51:42] <hashar>	 sorry for the spam
[10:51:45] <wikibugs>	 (03PS1) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging [puppet] - 10https://gerrit.wikimedia.org/r/865591
[10:51:47] <wikibugs>	 (03PS1) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943)
[10:52:05] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudcumin1001.eqiad.wmnet - volans@cumin1001"
[10:53:06] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudcumin1001.eqiad.wmnet - volans@cumin1001"
[10:53:06] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:53:06] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.wipe-cache cloudcumin1001.eqiad.wmnet on all recursors
[10:53:09] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcumin1001.eqiad.wmnet on all recursors
[10:53:58] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar)
[10:54:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar)
[10:55:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[10:56:09] <wikibugs>	 (03PS1) 10Marostegui: change_echo_unread_wikis_T255174.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174)
[10:58:18] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host cloudcumin1001.eqiad.wmnet
[10:58:26] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) contint1001 keeps crashing due to a faulty memory stick. It happened on October 31st ( T294276#8357385 ) and ag...
[10:58:48] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] change_echo_unread_wikis_T255174.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui)
[10:59:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM request for cloudcumin1001 - https://phabricator.wikimedia.org/T323516 (10Volans) VM successfully created running:  ` sudo cookbook sre.ganeti.makevm --cluster eqiad --group D cloudcumin1001 `
[10:59:42] <wikibugs>	 (03PS2) 10Marostegui: change_echo_unread_wikis_T255174.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174)
[10:59:46] <wikibugs>	 (03CR) 10Marostegui: change_echo_unread_wikis_T255174.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui)
[11:01:23] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.ganeti.makevm for new host cloudcumin2001.codfw.wmnet
[11:01:24] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[11:02:41] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] change_echo_unread_wikis_T255174.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui)
[11:02:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] change_echo_unread_wikis_T255174.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui)
[11:03:15] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:03:41] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:05:09] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10Volans) For the latter part, you can move all the RO pre-requisite checks in the cookbook's __init__ that is run before the START !log, so that the failure wi...
[11:05:16] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudcumin2001.codfw.wmnet - volans@cumin2002"
[11:05:32] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: move replica definition to per-DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/865595
[11:06:19] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudcumin2001.codfw.wmnet - volans@cumin2002"
[11:06:19] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:06:19] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.wipe-cache cloudcumin2001.codfw.wmnet on all recursors
[11:06:22] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcumin2001.codfw.wmnet on all recursors
[11:07:00] <wikibugs>	 (03PS3) 10Clément Goubert: P:mediawiki::php:monitoring: Longer opcache delay [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649)
[11:07:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/863381 (owner: 10Volans)
[11:08:40] <wikibugs>	 (03PS2) 10Volans: sre.hosts.reimage: call the Hiera cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/863381
[11:09:10] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38613/console" [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert)
[11:11:37] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host cloudcumin2001.codfw.wmnet
[11:11:55] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey)
[11:12:31] <wikibugs>	 (03PS4) 10Volans: cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401)
[11:12:54] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:13:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: (1) VM request for cumincloud2001 - https://phabricator.wikimedia.org/T323518 (10Volans) VM created with: ` sudo cookbook sre.ganeti.makevm --cluster codfw --group C cloudcumin2001 `
[11:13:57] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: call the Hiera cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/863381 (owner: 10Volans)
[11:14:20] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10jbond) >mwdeploy has uid/gid 499 in prod hosts Just wanted to note that this is not quote the case.  On most hosts the uid is 499 how...
[11:15:34] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: call the Hiera cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/863381 (owner: 10Volans)
[11:18:44] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: increase limitrange for containers/pods in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624)
[11:19:42] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: ml-services: increase limitrange for containers/pods in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624)
[11:19:58] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[11:22:05] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "lgtm apart from minor typo, -1 is for the missing $" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans)
[11:24:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[11:24:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[11:25:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:26:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: increase limitrange for containers/pods in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos)
[11:28:49] <wikibugs>	 (03Merged) 10jenkins-bot: spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[11:29:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[11:30:28] <wikibugs>	 (03PS1) 10Volans: cloud-cumin: set the installer to use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/865600 (https://phabricator.wikimedia.org/T319401)
[11:30:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[11:31:38] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "self-merging, wrong OS" [puppet] - 10https://gerrit.wikimedia.org/r/865600 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[11:31:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/865061 (owner: 10Muehlenhoff)
[11:33:09] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[11:33:50] <wikibugs>	 (03PS3) 10Muehlenhoff: package_builder: Don't fail on cleanup jobs [puppet] - 10https://gerrit.wikimedia.org/r/865061
[11:35:22] <wikibugs>	 (03PS2) 10Hnowlan: thumbor: enable mesh, move replicas to main values [deployment-charts] - 10https://gerrit.wikimedia.org/r/865595
[11:36:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Don't fail on cleanup jobs [puppet] - 10https://gerrit.wikimedia.org/r/865061 (owner: 10Muehlenhoff)
[11:37:15] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:38:01] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] superset: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862883 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[11:39:25] <wikibugs>	 10SRE, 10Cloud-Services, 10observability, 10Sustainability (Incident Followup), and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff) p:05Triage→03Medium
[11:40:58] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[11:41:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "I can't meaningfully say on whether this will be an improvement over the (AFAICT) unattended warning, but open to try!" [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert)
[11:42:50] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) >>! In T322048#8449909, @ayounsi wrote: > FYI, there are outstanding Homer diffs for asw1-eqsin: > `lang=diff > [edit interfaces] > -   ge-0/0/16 { > -       description DISAB...
[11:42:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Add component/rsyslog-openssl for Buster [puppet] - 10https://gerrit.wikimedia.org/r/865602 (https://phabricator.wikimedia.org/T324623)
[11:45:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: [WiP] Add base.volume module [deployment-charts] - 10https://gerrit.wikimedia.org/r/865603
[11:46:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add component/rsyslog-openssl for Buster [puppet] - 10https://gerrit.wikimedia.org/r/865602 (https://phabricator.wikimedia.org/T324623) (owner: 10Muehlenhoff)
[11:48:06] <wikibugs>	 (03PS1) 10Ssingh: ntp/eqsin: move to dns5004 [dns] - 10https://gerrit.wikimedia.org/r/865605 (https://phabricator.wikimedia.org/T323830)
[11:50:57] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@2e0d44b]: Spelling, coobooks -> cookbooks
[11:51:11] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@2e0d44b]: Spelling, coobooks -> cookbooks (duration: 00m 14s)
[11:51:46] <wikibugs>	 (03PS1) 10Volans: cloud_management: fix missing key in hiera [puppet] - 10https://gerrit.wikimedia.org/r/865608
[11:52:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add component/rsyslog-openssl for Buster [puppet] - 10https://gerrit.wikimedia.org/r/865602 (https://phabricator.wikimedia.org/T324623) (owner: 10Muehlenhoff)
[11:54:31] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] ntp/eqsin: move to dns5004 [dns] - 10https://gerrit.wikimedia.org/r/865605 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[11:54:48] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[11:55:03] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[11:55:15] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[11:55:27] <sukhe>	 !log running authdns-update for Gerrit: 865605
[11:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:42] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[11:56:02] <wikibugs>	 (03PS1) 10Ssingh: hiera: decommission dns5002 [puppet] - 10https://gerrit.wikimedia.org/r/865610 (https://phabricator.wikimedia.org/T323830)
[11:56:20] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: remove dns5002 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/865611 (https://phabricator.wikimedia.org/T323830)
[11:57:07] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "fix puppet" [puppet] - 10https://gerrit.wikimedia.org/r/865608 (owner: 10Volans)
[11:57:51] <moritzm>	 !log imported librelp 1.10.0-1~buster1 to component/rsyslog-openssl T324623
[11:57:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:56] <stashbot>	 T324623: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623
[11:59:24] <wikibugs>	 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) I've a package of rclone 1.60.1 that builds cleanly against unstable now; I'll be uploading it soon (tomorrow unless anyone on the go team objects).
[11:59:28] <moritzm>	 !log imported rsyslog 8.2102.0-2+deb11u1~buster1 to component/rsyslog-openssl T324623
[11:59:30] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[11:59:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:57] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[12:02:34] <wikibugs>	 10SRE-swift-storage: Update Debian rclone package to 1.60.0 - https://phabricator.wikimedia.org/T322547 (10MatthewVernon) I have a package that's pretty much ready to go, hopefully upload tomorrow.
[12:03:19] <wikibugs>	 (03PS1) 10Ssingh: lvs5005: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865613 (https://phabricator.wikimedia.org/T322048)
[12:04:40] <wikibugs>	 (03PS1) 10Volans: cloud_management: add missing wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/865614 (https://phabricator.wikimedia.org/T319401)
[12:05:22] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add lvs5005 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865615 (https://phabricator.wikimedia.org/T322048)
[12:05:40] <wikibugs>	 10SRE, 10Cloud-Services, 10observability, 10Patch-For-Review, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff) This turned out to a little more complicated than initially assumed. I've now built a backport of the version that is in Bullseye (w...
[12:07:18] <wikibugs>	 (03CR) 10Jbond: "-1 is for the incorrect param name in the doc string, but s other comments.  i have also added kieth as observability own this infrastruct" [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[12:07:31] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "fix puppet" [puppet] - 10https://gerrit.wikimedia.org/r/865614 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[12:09:32] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[12:09:55] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[12:10:08] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[12:10:40] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[12:10:52] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation
[12:11:02] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[12:11:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[12:12:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[12:12:50] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[12:13:40] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[12:14:09] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[12:14:34] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[12:14:53] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation
[12:15:08] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[12:15:21] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[12:15:24] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[12:15:58] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[12:16:25] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[12:16:35] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[12:17:22] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[12:18:21] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[12:18:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] superset: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862883 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[12:19:15] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[12:22:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] hue: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862885 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[12:24:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] yarn: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862886 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[12:25:28] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for Superset [puppet] - 10https://gerrit.wikimedia.org/r/862933 (https://phabricator.wikimedia.org/T135991)
[12:27:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:27:52] <volans>	 that's me, WIP
[12:28:47] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation
[12:28:48] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation
[12:29:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Superset [puppet] - 10https://gerrit.wikimedia.org/r/862933 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[12:30:31] <moritzm>	 !log upgrading cloudweb to PHP 7.4.33
[12:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:33] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:32:45] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation
[12:32:59] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation
[12:33:45] <moritzm>	 !log upgrading deployment servers to PHP 7.4.33
[12:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:12] <wikibugs>	 (03PS1) 10Volans: cloud_management: re-add datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/865617 (https://phabricator.wikimedia.org/T319401)
[12:36:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865617 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[12:37:24] <moritzm>	 !log upgrading mwmaint servers to PHP 7.4.33
[12:37:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:57] <wikibugs>	 (03CR) 10Jbond: "some minor nits however its robably worth touching base with o11y as i have also seen some tls related changes relating to rsysog from the" [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[12:40:58] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cloud_management: re-add datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/865617 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[12:42:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[12:48:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] analytics::refinery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858604 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:48:25] <wikibugs>	 (03PS2) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging [puppet] - 10https://gerrit.wikimedia.org/r/865591
[12:48:27] <wikibugs>	 (03PS2) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943)
[12:48:29] <wikibugs>	 (03PS1) 10JMeybohm: kubeadm: Declare /etc/kubernetes directory resource directly [puppet] - 10https://gerrit.wikimedia.org/r/865619
[12:49:29] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add lvs5005 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865615 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[12:51:25] <wikibugs>	 (03CR) 10JMeybohm: "I'm going to (re)move that class in a follow up patch having it create another directory and hopefully it will be going away after the mig" [puppet] - 10https://gerrit.wikimedia.org/r/865619 (owner: 10JMeybohm)
[12:51:56] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ayounsi) >>! In T322048#8450256, @ssingh wrote: >>>! In T322048#8449909, @ayounsi wrote: >> FYI, there are outstanding Homer diffs for asw1-eqsin: >> `lang=diff...
[12:52:02] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove dns5002 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/865611 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[12:59:09] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10ayounsi)     >>! In T324655#8450198, @Volans wrote: > For the latter part, you can move all the RO pre-requisite checks in the cookbook's __init__ that is run...
[13:09:31] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudcumin1001.eqiad.wmnet with reason: First installation
[13:09:44] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudcumin1001.eqiad.wmnet with reason: First installation
[13:10:16] <wikibugs>	 (03PS1) 10Clément Goubert: P:docker::builder: Add otelcol-contrib uid mapping [puppet] - 10https://gerrit.wikimedia.org/r/865623
[13:11:59] <wikibugs>	 (03CR) 10Clément Goubert: Add a new production image for otelcol (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert)
[13:15:59] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38616/console" [puppet] - 10https://gerrit.wikimedia.org/r/865623 (owner: 10Clément Goubert)
[13:16:24] <wikibugs>	 (03CR) 10Clément Goubert: P:docker::builder: Add otelcol-contrib uid mapping [puppet] - 10https://gerrit.wikimedia.org/r/865623 (owner: 10Clément Goubert)
[13:18:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206', diff saved to https://phabricator.wikimedia.org/P42443 and previous config saved to /var/cache/conftool/dbconfig/20221207-131858-marostegui.json
[13:19:17] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1206: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/865517
[13:20:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1206: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/865517 (owner: 10Marostegui)
[13:22:51] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Created cloudcumin instances - volans@cumin1001"
[13:25:56] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Created cloudcumin instances - volans@cumin1001"
[13:33:37] <claime>	 jouncebot: nowandnext
[13:33:37] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 26 minute(s)
[13:33:37] <jouncebot>	 In 0 hour(s) and 26 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T1400)
[13:34:45] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.13  refs T320518
[13:34:49] <stashbot>	 T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518
[13:35:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:38:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove misc-apache Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/865625
[13:40:01] <wikibugs>	 (03PS4) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[13:41:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove misc-apache Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/865625 (owner: 10Muehlenhoff)
[13:42:30] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.13  refs T320518 (duration: 07m 45s)
[13:42:34] <stashbot>	 T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518
[13:48:00] <wikibugs>	 (03PS1) 10Muehlenhoff: doc: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865646 (https://phabricator.wikimedia.org/T135991)
[13:49:38] <wikibugs>	 (03PS4) 10Clément Goubert: P:mediawiki::php:monitoring: Longer opcache delay [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649)
[13:54:56] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: ganeti500[567] implementation tracking for serviceops - https://phabricator.wikimedia.org/T324610 (10MoritzMuehlenhoff) Ack, decomming these by mid January sounds doable!
[13:55:28] <wikibugs>	 (03PS3) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943)
[13:59:33] <wikibugs>	 (03PS1) 10Muehlenhoff: prometheus: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865648 (https://phabricator.wikimedia.org/T135991)
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T1400).
[14:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:01:13] <Lucas_WMDE>	 o/
[14:01:24] <Lucas_WMDE>	 yup, looks like nothing to do
[14:02:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on an-tool1005.eqiad.wmnet with reason: redeploying an-tool1005 as bullseye
[14:03:08] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on an-tool1005.eqiad.wmnet with reason: redeploying an-tool1005 as bullseye
[14:05:55] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10Volans) >>! In T324655#8450468, @ayounsi wrote: >  >  >  >  >>>! In T324655#8450198, @Volans wrote: >> For the latter part, you can move all the RO pre-requis...
[14:11:18] <wikibugs>	 (03CR) 10JMeybohm: "I would also argue not to remove things from the chart that can just stay disabled/unused to allow for easier merging of upstream changes " [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:12:41] <wikibugs>	 (03PS1) 10Hashar: hiera: reorder contint1001 entries [puppet] - 10https://gerrit.wikimedia.org/r/865649
[14:14:00] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns5002 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/865611 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[14:14:43] <wikibugs>	 (03Merged) 10jenkins-bot: sites.yaml: remove dns5002 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/865611 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[14:16:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dns5002.wikimedia.org with reason: downtimed, to be depooled
[14:17:04] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dns5002.wikimedia.org with reason: downtimed, to be depooled
[14:18:07] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 23 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:18:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns5002 [puppet] - 10https://gerrit.wikimedia.org/r/865610 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[14:18:23] <wikibugs>	 (03CR) 10JMeybohm: flink-kubernetes-operator - modify for WMF (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:18:29] <wikibugs>	 (03PS2) 10Ssingh: hiera: decommission dns5002 [puppet] - 10https://gerrit.wikimedia.org/r/865610 (https://phabricator.wikimedia.org/T323830)
[14:20:13] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns5002.wikimedia.org
[14:22:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:24:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[14:25:01] <wikibugs>	 (03PS1) 10Ayounsi: OSPF: update drmrs GTT interface name [homer/public] - 10https://gerrit.wikimedia.org/r/865652 (https://phabricator.wikimedia.org/T324047)
[14:26:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] OSPF: update drmrs GTT interface name [homer/public] - 10https://gerrit.wikimedia.org/r/865652 (https://phabricator.wikimedia.org/T324047) (owner: 10Ayounsi)
[14:26:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[14:26:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:26:53] <wikibugs>	 (03PS9) 10Awight: kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe)
[14:27:00] <wikibugs>	 (03Merged) 10jenkins-bot: OSPF: update drmrs GTT interface name [homer/public] - 10https://gerrit.wikimedia.org/r/865652 (https://phabricator.wikimedia.org/T324047) (owner: 10Ayounsi)
[14:27:41] <moritzm>	 !log restarting ntpd
[14:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe)
[14:27:56] <wikibugs>	 (03PS1) 10Ssingh: dns5003: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/865657 (https://phabricator.wikimedia.org/T322048)
[14:28:10] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[14:28:10] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:28:11] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns5002.wikimedia.org
[14:28:18] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns5002.wikimedia.org` - dns5002.wikimedia....
[14:28:58] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh)
[14:29:47] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:29:55] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:30:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM, but I'll defer to John the final green light!" [puppet] - 10https://gerrit.wikimedia.org/r/865075 (owner: 10JMeybohm)
[14:31:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:31:59] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns5003: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/865657 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[14:32:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[14:32:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns5003.wikimedia.org with OS buster
[14:33:06] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS buster
[14:33:14] <wikibugs>	 (03CR) 10Elukey: "Do we have a pcc run to see the diffs?" [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm)
[14:35:00] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add dns5003 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865660 (https://phabricator.wikimedia.org/T322048)
[14:38:26] <XioNoX>	 !log draining Arelion eqiad-codfw circuit for optic replacement
[14:38:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:19] <wikibugs>	 (03PS2) 10Ssingh: lvs5005: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865613 (https://phabricator.wikimedia.org/T322048)
[14:41:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:42:55] <wikibugs>	 (03CR) 10Herron: [C: 03+1] prometheus: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865648 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:43:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs5005: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865613 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[14:44:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5005.eqsin.wmnet with OS buster
[14:44:57] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5005.eqsin.wmnet with OS buster
[14:46:01] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:47:48] <wikibugs>	 (03PS5) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026
[14:49:05] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi)
[14:49:24] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans)
[14:50:25] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans)
[14:52:21] <wikibugs>	 (03PS1) 10Btullis: Upgrade an-tool1005 from buster to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/865669 (https://phabricator.wikimedia.org/T323458)
[14:55:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865648 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:56:14] <logmsgbot>	 !log krinkle@deploy1002 Started deploy [performance/navtiming@6caa033]: (no justification provided)
[14:56:22] <logmsgbot>	 !log krinkle@deploy1002 Finished deploy [performance/navtiming@6caa033]: (no justification provided) (duration: 00m 07s)
[14:56:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:01:53] <wikibugs>	 (03PS1) 10Btullis: Update the mediawiki_history_snapshot in use by AQS [puppet] - 10https://gerrit.wikimedia.org/r/865671
[15:01:57] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[15:01:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:03:04] <wikibugs>	 (03CR) 10Milimetric: [C: 03+1] Update the mediawiki_history_snapshot in use by AQS [puppet] - 10https://gerrit.wikimedia.org/r/865671 (owner: 10Btullis)
[15:03:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5003.wikimedia.org with reason: host reimage
[15:04:19] <wikibugs>	 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10Eevans) >>! In T320675#8449574, @dmaza wrote: >>>! In T320675#8368902, @Eevans wrote: >> TL;DR I...
[15:06:31] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Upgrade an-tool1005 from buster to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/865669 (https://phabricator.wikimedia.org/T323458) (owner: 10Btullis)
[15:06:46] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5003.wikimedia.org with reason: host reimage
[15:07:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:07:30] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the mediawiki_history_snapshot in use by AQS [puppet] - 10https://gerrit.wikimedia.org/r/865671 (owner: 10Btullis)
[15:08:11] <icinga-wm>	 PROBLEM - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[15:09:11] <icinga-wm>	 RECOVERY - jenkins_service_running on releases1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins
[15:09:42] <hashar>	 releases1002 alarmed cause I was restarting Jenkins there
[15:10:05] <claime>	 ack
[15:11:12] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[15:11:21] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[15:12:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:12:24] <sukhe>	 ^ expected
[15:12:43] <claime>	 ack thanks
[15:12:55] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:13:29] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage
[15:14:03] <wikibugs>	 (03CR) 10David Caro: "LGTM, let me try to test it in toolsbeta" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[15:16:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage
[15:16:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:17:14] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:17:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] search: drop search-drop-query-clicks systemd timer (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/865073 (owner: 10DCausse)
[15:17:49] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:15] <wikibugs>	 (03PS1) 10Hashar: contint: give access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832)
[15:19:13] <icinga-wm>	 PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[15:19:32] <sukhe>	 ^ expected, should resolve soon
[15:19:48] <claime>	 ack thanks
[15:20:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache on VRTS [puppet] - 10https://gerrit.wikimedia.org/r/865674 (https://phabricator.wikimedia.org/T135991)
[15:23:01] <icinga-wm>	 RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[15:23:51] <icinga-wm>	 RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[15:24:35] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[15:26:06] <wikibugs>	 10SRE, 10Cloud-Services, 10observability, 10Patch-For-Review, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10Andrew) Wow, instant gratification! Thank you @MoritzMuehlenhoff, I will test.
[15:26:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:31:31] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[15:33:37] <wikibugs>	 (03PS6) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581)
[15:34:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:36:14] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:36:14] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5003.wikimedia.org with OS buster
[15:36:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[15:36:31] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS buster completed: - dns5003 (**PASS**)...
[15:37:32] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[15:37:55] <logmsgbot>	 !log jiji@deploy1002 Started scap: Backport for [[gerrit:865121|ProductionServices: Use redis_misc servers for LockManager (4/6) (T267581)]]
[15:37:58] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[15:37:59] <icinga-wm>	 RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:39:49] <logmsgbot>	 !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865121|ProductionServices: Use redis_misc servers for LockManager (4/6) (T267581)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[15:40:04] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:41:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:25] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[15:41:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5005.eqsin.wmnet with OS buster
[15:41:34] <wikibugs>	 (03CR) 10David Caro: "The webservice starts as expected:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[15:41:38] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5005.eqsin.wmnet with OS buster completed: - lvs5005 (**PASS**)...
[15:43:02] <wikibugs>	 (03PS1) 10Herron: update role_contacts for thanos (front|back)end [puppet] - 10https://gerrit.wikimedia.org/r/865679
[15:44:29] <wikibugs>	 (03PS2) 10Hashar: contint: give RelEng access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832)
[15:44:31] <wikibugs>	 (03PS1) 10Hashar: contint: add ci::master to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832)
[15:44:33] <wikibugs>	 (03PS1) 10Hashar: contint: add contint1002 as a scap target [puppet] - 10https://gerrit.wikimedia.org/r/865681 (https://phabricator.wikimedia.org/T313832)
[15:44:58] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[15:45:41] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:25] <logmsgbot>	 !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865121|ProductionServices: Use redis_misc servers for LockManager (4/6) (T267581)]] (duration: 08m 29s)
[15:46:28] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[15:48:09] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[15:48:11] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "seems reasonable, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/865066 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:48:53] <wikibugs>	 (03PS4) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (5/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865122 (https://phabricator.wikimedia.org/T267581)
[15:49:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache/FPM/Envoy on mwmaint/noc [puppet] - 10https://gerrit.wikimedia.org/r/865066 (https://phabricator.wikimedia.org/T135991)
[15:49:29] <wikibugs>	 (03PS3) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (6/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865123 (https://phabricator.wikimedia.org/T267581)
[15:50:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Apache/FPM/Envoy on mwmaint/noc [puppet] - 10https://gerrit.wikimedia.org/r/865066 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:50:47] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1026 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fc2371af320: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[15:50:47] <wikibugs>	 (03CR) 10Effie Mouzeli: Redis sessions: Goodbye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[15:50:47] <icinga-wm>	 org/wiki/Search%23Administration
[15:51:04] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: add restbase routing, enable in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865683 (https://phabricator.wikimedia.org/T322152)
[15:51:58] <wikibugs>	 (03CR) 10Andrew Bogott: "thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[15:52:09] <wikibugs>	 (03Abandoned) 10Muehlenhoff: puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond)
[15:52:19] <wikibugs>	 (03PS4) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943)
[15:52:49] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (034 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[15:56:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865122 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[15:57:08] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (5/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865122 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[15:57:32] <logmsgbot>	 !log jiji@deploy1002 Started scap: Backport for [[gerrit:865122|ProductionServices: Use redis_misc servers for LockManager (5/6) (T267581)]]
[15:57:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] webservice cli: allow for deployment of custom harbor images (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[15:57:36] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[15:57:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] update role_contacts for thanos (front|back)end [puppet] - 10https://gerrit.wikimedia.org/r/865679 (owner: 10Herron)
[15:58:15] <wikibugs>	 (03CR) 10Herron: [C: 03+2] update role_contacts for thanos (front|back)end [puppet] - 10https://gerrit.wikimedia.org/r/865679 (owner: 10Herron)
[15:58:46] <wikibugs>	 (03CR) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[15:58:51] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[15:59:25] <logmsgbot>	 !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865122|ProductionServices: Use redis_misc servers for LockManager (5/6) (T267581)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[15:59:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) a:03BTullis Is it OK if I have a crack at this @papaul?
[15:59:43] <wikibugs>	 (03PS12) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717)
[15:59:45] <wikibugs>	 (03PS4) 10Andrew Bogott: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717)
[16:00:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Update cxserver to 2022-12-06-121330-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865063 (https://phabricator.wikimedia.org/T321781) (owner: 10KartikMistry)
[16:02:13] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] "The manifest update needs fixing as Taavi pointed out ;)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[16:02:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[16:03:48] <wikibugs>	 (03CR) 10Elukey: "Left a couple of nits, but overall it makes sense. Didn't get to review in detail all the changes in {master,node}.pp yet :(" [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[16:05:16] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add dns5003 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865660 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[16:06:51] <sukhe>	 !log run homer in cr*-eqsin for Gerrit: 865660
[16:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:02] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] eqsin cp: unify per-node hieradata [puppet] - 10https://gerrit.wikimedia.org/r/865120 (https://phabricator.wikimedia.org/T322048) (owner: 10BBlack)
[16:08:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Redis sessions: Goodbye [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[16:08:32] <logmsgbot>	 !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865122|ProductionServices: Use redis_misc servers for LockManager (5/6) (T267581)]] (duration: 10m 59s)
[16:08:35] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[16:09:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:10:46] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh)
[16:14:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:18:20] <wikibugs>	 (03PS1) 10Ssingh: lvs5002: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/865687 (https://phabricator.wikimedia.org/T323830)
[16:19:16] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38619/console" [puppet] - 10https://gerrit.wikimedia.org/r/865687 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[16:22:58] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs5005 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865615 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[16:24:35] <sukhe>	 !log run homer in cr*-eqsin for Gerrit: 865615
[16:24:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance
[16:25:15] <aqu>	 !log Deploying analytics/refinery (HDFS usage scripts)
[16:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[16:25:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance
[16:25:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P42446 and previous config saved to /var/cache/conftool/dbconfig/20221207-162533-ladsgroup.json
[16:25:36] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[16:25:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[16:25:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42447 and previous config saved to /var/cache/conftool/dbconfig/20221207-162553-ladsgroup.json
[16:27:19] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@349e1cc]: Deploy HDFS usage dataset generation scripts [analytics/refinery@349e1cc]
[16:27:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P42448 and previous config saved to /var/cache/conftool/dbconfig/20221207-162745-ladsgroup.json
[16:28:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42449 and previous config saved to /var/cache/conftool/dbconfig/20221207-162802-ladsgroup.json
[16:29:26] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1026.eqiad.wmnet with OS bullseye
[16:29:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance
[16:30:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance
[16:30:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[16:30:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance
[16:30:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T322618)', diff saved to https://phabricator.wikimedia.org/P42450 and previous config saved to /var/cache/conftool/dbconfig/20221207-163031-ladsgroup.json
[16:32:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T322618)', diff saved to https://phabricator.wikimedia.org/P42451 and previous config saved to /var/cache/conftool/dbconfig/20221207-163242-ladsgroup.json
[16:32:46] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[16:33:07] <wikibugs>	 (03PS1) 10BBlack: cp: remove the last haproxy role refs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/865691
[16:35:18] <wikibugs>	 (03PS1) 10Cwhite: logstash: move alertmanager severity field to labels.check_severity [puppet] - 10https://gerrit.wikimedia.org/r/865631 (https://phabricator.wikimedia.org/T324684)
[16:35:55] <wikibugs>	 (03CR) 10BBlack: "NOP in PCC just for extra verification: https://puppet-compiler.wmflabs.org/output/865691/38620/" [puppet] - 10https://gerrit.wikimedia.org/r/865691 (owner: 10BBlack)
[16:36:04] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cp: remove the last haproxy role refs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/865691 (owner: 10BBlack)
[16:36:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[16:36:35] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Remove netmon2001 from the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695)
[16:36:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job es_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:38:07] <sukhe>	 !log cr[23]-eqsin*: set routing-options static route 103.102.166.240/28 next-hop 10.132.0.6: T322048
[16:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:10] <stashbot>	 T322048: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048
[16:38:54] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38621/console" [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[16:40:17] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results:" [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[16:40:42] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] lvs5002: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/865687 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[16:40:54] <wikibugs>	 (03PS5) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943)
[16:41:30] <wikibugs>	 (03PS2) 10Eevans: echostore: bring codfw hosts up to date [deployment-charts] - 10https://gerrit.wikimedia.org/r/862307 (https://phabricator.wikimedia.org/T253244)
[16:42:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs5002.eqsin.wmnet with reason: downtimed, in the process of decom
[16:42:30] <sukhe>	 !log restart pybal on lvs5002
[16:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:35] <wikibugs>	 (03CR) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[16:42:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[16:42:42] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs5002.eqsin.wmnet with reason: downtimed, in the process of decom
[16:42:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42452 and previous config saved to /var/cache/conftool/dbconfig/20221207-164252-ladsgroup.json
[16:42:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[16:42:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T322618)', diff saved to https://phabricator.wikimedia.org/P42453 and previous config saved to /var/cache/conftool/dbconfig/20221207-164258-ladsgroup.json
[16:43:02] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[16:43:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42454 and previous config saved to /var/cache/conftool/dbconfig/20221207-164308-ladsgroup.json
[16:43:13] * elukey bbiab
[16:43:22] <elukey>	 err wrong chan :)
[16:45:31] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Remove the netmon2001 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695)
[16:45:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865123 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[16:46:31] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (6/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865123 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[16:46:33] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] echostore: bring codfw hosts up to date [deployment-charts] - 10https://gerrit.wikimedia.org/r/862307 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans)
[16:46:55] <logmsgbot>	 !log jiji@deploy1002 Started scap: Backport for [[gerrit:865123|ProductionServices: Use redis_misc servers for LockManager (6/6) (T267581)]]
[16:46:59] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[16:47:12] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38622/console" [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[16:47:19] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38623/console" [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[16:47:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P42455 and previous config saved to /var/cache/conftool/dbconfig/20221207-164748-ladsgroup.json
[16:48:09] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] echostore: bring codfw hosts up to date [deployment-charts] - 10https://gerrit.wikimedia.org/r/862307 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans)
[16:48:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T322618)', diff saved to https://phabricator.wikimedia.org/P42456 and previous config saved to /var/cache/conftool/dbconfig/20221207-164809-ladsgroup.json
[16:48:13] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[16:48:49] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38624/console" [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[16:48:52] <logmsgbot>	 !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865123|ProductionServices: Use redis_misc servers for LockManager (6/6) (T267581)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[16:49:15] <wikibugs>	 (03PS1) 10Cmjohnson: updateing site.pp for kubernetes servers to change role to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/865632 (https://phabricator.wikimedia.org/T313873)
[16:50:31] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865695/38624/" [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[16:51:03] <wikibugs>	 (03PS1) 10Ssingh: lvs5005: set as high-traffic2 primary LVS and remove lvs5002 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865701 (https://phabricator.wikimedia.org/T323830)
[16:51:21] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] updateing site.pp for kubernetes servers to change role to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/865632 (https://phabricator.wikimedia.org/T313873) (owner: 10Cmjohnson)
[16:51:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job es_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:53:17] <wikibugs>	 (03PS3) 10Hnowlan: thumbor: move replicas to main values, use swift discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/865595
[16:53:37] <wikibugs>	 (03Merged) 10jenkins-bot: echostore: bring codfw hosts up to date [deployment-charts] - 10https://gerrit.wikimedia.org/r/862307 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans)
[16:54:43] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Add the netmon2002 as a LibreNMS scap deploy target [puppet] - 10https://gerrit.wikimedia.org/r/865705 (https://phabricator.wikimedia.org/T315523)
[16:55:10] <wikibugs>	 (03PS1) 10RobH: updating role [puppet] - 10https://gerrit.wikimedia.org/r/865706 (https://phabricator.wikimedia.org/T322048)
[16:55:14] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply
[16:55:16] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply
[16:55:22] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply
[16:55:39] <wikibugs>	 (03PS2) 10RobH: updating role [puppet] - 10https://gerrit.wikimedia.org/r/865706 (https://phabricator.wikimedia.org/T322048)
[16:56:01] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply
[16:56:02] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38625/console" [puppet] - 10https://gerrit.wikimedia.org/r/865705 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[16:56:13] <wikibugs>	 (03CR) 10RobH: [C: 03+2] updating role [puppet] - 10https://gerrit.wikimedia.org/r/865706 (https://phabricator.wikimedia.org/T322048) (owner: 10RobH)
[16:56:56] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865705/38625/" [puppet] - 10https://gerrit.wikimedia.org/r/865705 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[16:57:54] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1024.eqiad.wmnet with OS bullseye
[16:57:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42457 and previous config saved to /var/cache/conftool/dbconfig/20221207-165758-ladsgroup.json
[16:58:01] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add Wenjun Fan to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/865177 (https://phabricator.wikimedia.org/T324057) (owner: 10JHathaway)
[16:58:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye
[16:58:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42458 and previous config saved to /var/cache/conftool/dbconfig/20221207-165815-ladsgroup.json
[16:58:37] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1023.eqiad.wmnet with OS bullseye
[16:58:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye
[16:59:53] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) 05Open→03Resolved @AnnWF done!
[17:00:40] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Add the netmon2002 instance as a ganeti rapi node. [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523)
[17:01:42] <logmsgbot>	 !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865123|ProductionServices: Use redis_misc servers for LockManager (6/6) (T267581)]] (duration: 14m 46s)
[17:01:45] <stashbot>	 T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581
[17:01:52] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38626/console" [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[17:01:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) @BTullis feel free
[17:02:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P42459 and previous config saved to /var/cache/conftool/dbconfig/20221207-170256-ladsgroup.json
[17:03:03] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:03:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P42460 and previous config saved to /var/cache/conftool/dbconfig/20221207-170316-ladsgroup.json
[17:04:48] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Remove rsync quickdatacopy failover restrictions [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074)
[17:06:29] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38627/console" [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[17:07:11] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865707/38626/" [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[17:07:33] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865707/38626/" [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[17:08:13] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/865633 (https://phabricator.wikimedia.org/T324692)
[17:08:15] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "This constraint is no longer required." [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[17:08:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs5002.eqsin.wmnet
[17:08:52] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for Dom Walden - https://phabricator.wikimedia.org/T323549 (10jhathaway) 05Open→03Resolved a:03jhathaway @dom_walden done!
[17:10:03] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1024.eqiad.wmnet with reason: host reimage
[17:10:41] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1023.eqiad.wmnet with reason: host reimage
[17:11:25] <wikibugs>	 (03PS3) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging [puppet] - 10https://gerrit.wikimedia.org/r/865591
[17:11:27] <wikibugs>	 (03PS6) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943)
[17:11:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 38 hosts with reason: Primary switchover s1 T324692
[17:11:55] <stashbot>	 T324692: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T324692
[17:12:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 38 hosts with reason: Primary switchover s1 T324692
[17:12:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[17:13:02] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1024.eqiad.wmnet with reason: host reimage
[17:13:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P42461 and previous config saved to /var/cache/conftool/dbconfig/20221207-171305-ladsgroup.json
[17:13:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance
[17:13:08] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[17:13:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance
[17:13:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42462 and previous config saved to /var/cache/conftool/dbconfig/20221207-171321-ladsgroup.json
[17:13:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[17:13:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T322618)', diff saved to https://phabricator.wikimedia.org/P42463 and previous config saved to /var/cache/conftool/dbconfig/20221207-171326-ladsgroup.json
[17:13:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[17:13:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42464 and previous config saved to /var/cache/conftool/dbconfig/20221207-171342-ladsgroup.json
[17:14:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2103 with weight 0 T324692', diff saved to https://phabricator.wikimedia.org/P42465 and previous config saved to /var/cache/conftool/dbconfig/20221207-171416-ladsgroup.json
[17:14:27] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Set netmon2002 the main instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523)
[17:14:41] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[17:15:32] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1023.eqiad.wmnet with reason: host reimage
[17:15:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T322618)', diff saved to https://phabricator.wikimedia.org/P42466 and previous config saved to /var/cache/conftool/dbconfig/20221207-171538-ladsgroup.json
[17:15:44] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38629/console" [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[17:15:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42467 and previous config saved to /var/cache/conftool/dbconfig/20221207-171551-ladsgroup.json
[17:16:13] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs5002 [homer/public] - 10https://gerrit.wikimedia.org/r/865712 (https://phabricator.wikimedia.org/T323830)
[17:16:57] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865711/38629/" [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[17:17:17] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[17:17:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:17:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs5002.eqsin.wmnet
[17:17:26] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs5002.eqsin.wmnet` - lvs5002.eqsin.wmnet...
[17:18:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T322618)', diff saved to https://phabricator.wikimedia.org/P42468 and previous config saved to /var/cache/conftool/dbconfig/20221207-171803-ladsgroup.json
[17:18:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P42469 and previous config saved to /var/cache/conftool/dbconfig/20221207-171822-ladsgroup.json
[17:24:34] <wikibugs>	 (03PS1) 10Papaul: Fix typo for sretest2002 node in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/865715 (https://phabricator.wikimedia.org/T322578)
[17:24:56] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs5002 [homer/public] - 10https://gerrit.wikimedia.org/r/865712 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[17:25:54] <sukhe>	 !log running homer for Gerrit: 865712
[17:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:09] <logmsgbot>	 !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash1026.eqiad.wmnet with OS bullseye
[17:26:34] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001"
[17:27:39] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001"
[17:29:18] <wikibugs>	 (03PS2) 10Hashar: contint: add ci::master to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832)
[17:29:37] <wikibugs>	 (03PS3) 10BBlack: Add 'cdn' conftool service to all caches [puppet] - 10https://gerrit.wikimedia.org/r/863336 (https://phabricator.wikimedia.org/T324336)
[17:29:39] <wikibugs>	 (03PS3) 10BBlack: Switch pybal + scripts to 'cdn' service [puppet] - 10https://gerrit.wikimedia.org/r/863337 (https://phabricator.wikimedia.org/T324336)
[17:29:41] <wikibugs>	 (03PS3) 10BBlack: Remove legacy varnish-fe + ats-tls conftool keys [puppet] - 10https://gerrit.wikimedia.org/r/863338 (https://phabricator.wikimedia.org/T324336)
[17:29:43] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[17:30:25] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh)
[17:30:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42470 and previous config saved to /var/cache/conftool/dbconfig/20221207-173045-ladsgroup.json
[17:30:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42471 and previous config saved to /var/cache/conftool/dbconfig/20221207-173057-ladsgroup.json
[17:31:48] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] lvs5005: set as high-traffic2 primary LVS and remove lvs5002 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865701 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[17:32:08] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Fix typo for sretest2002 node in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/865715 (https://phabricator.wikimedia.org/T322578) (owner: 10Papaul)
[17:32:10] <wikibugs>	 (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[17:33:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T322618)', diff saved to https://phabricator.wikimedia.org/P42472 and previous config saved to /var/cache/conftool/dbconfig/20221207-173329-ladsgroup.json
[17:33:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[17:33:33] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[17:33:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[17:33:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P42473 and previous config saved to /var/cache/conftool/dbconfig/20221207-173350-ladsgroup.json
[17:35:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs5005: set as high-traffic2 primary LVS and remove lvs5002 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865701 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[17:35:37] <wikibugs>	 (03PS2) 10Ssingh: lvs5005: set as high-traffic2 primary LVS and remove lvs5002 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865701 (https://phabricator.wikimedia.org/T323830)
[17:36:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye
[17:36:21] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye
[17:36:30] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bullseye
[17:36:38] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye executed with errors: - sretest2002 (**FAIL**)   - **T...
[17:36:46] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye
[17:36:54] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye
[17:37:36] <wikibugs>	 (03PS1) 10JHathaway: Add Kelton Hurd to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/865716 (https://phabricator.wikimedia.org/T323941)
[17:38:16] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] Add Kelton Hurd to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/865716 (https://phabricator.wikimedia.org/T323941) (owner: 10JHathaway)
[17:38:24] <wikibugs>	 (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[17:40:45] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] netmon: Add the netmon2002 as a LibreNMS scap deploy target [puppet] - 10https://gerrit.wikimedia.org/r/865705 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[17:41:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10Patch-For-Review, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10jhathaway) 05Open→03Resolved a:03jhathaway @KHurd-WMF done!
[17:41:38] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/865633 (https://phabricator.wikimedia.org/T324692) (owner: 10Gerrit maintenance bot)
[17:41:42] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2002.codfw.wmnet with reason: host reimage
[17:41:42] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/865633 (https://phabricator.wikimedia.org/T324692) (owner: 10Gerrit maintenance bot)
[17:42:32] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add Kelton Hurd to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/865716 (https://phabricator.wikimedia.org/T323941) (owner: 10JHathaway)
[17:42:55] <sukhe>	 !log restart pybal on lvs5005 to pick up bgp-med
[17:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:20] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh)
[17:45:02] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2002.codfw.wmnet with reason: host reimage
[17:45:11] <Amir1>	 !log Starting s1 codfw failover from db2112 to db2103 - T324692
[17:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:13] <stashbot>	 T324692: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T324692
[17:45:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2103 to s1 primary T324692', diff saved to https://phabricator.wikimedia.org/P42474 and previous config saved to /var/cache/conftool/dbconfig/20221207-174540-ladsgroup.json
[17:45:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42475 and previous config saved to /var/cache/conftool/dbconfig/20221207-174551-ladsgroup.json
[17:46:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42476 and previous config saved to /var/cache/conftool/dbconfig/20221207-174604-ladsgroup.json
[17:46:32] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@349e1cc]: Deploy HDFS usage dataset generation scripts [analytics/refinery@349e1cc] (duration: 79m 12s)
[17:46:41] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1026']
[17:48:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2112 T324692', diff saved to https://phabricator.wikimedia.org/P42477 and previous config saved to /var/cache/conftool/dbconfig/20221207-174811-ladsgroup.json
[17:48:21] <wikibugs>	 (03PS1) 10JHathaway: Add Vaughn Walters to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/865718 (https://phabricator.wikimedia.org/T324515)
[17:48:46] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@349e1cc] (thin): Deploy HDFS usage dataset generation scripts THIN [analytics/refinery@349e1cc]
[17:48:53] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@349e1cc] (thin): Deploy HDFS usage dataset generation scripts THIN [analytics/refinery@349e1cc] (duration: 00m 07s)
[17:49:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[17:49:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[17:49:25] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@349e1cc] (hadoop-test): Deploy HDFS usage dataset generation scripts TEST [analytics/refinery@349e1cc]
[17:49:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-vaughnwalters, 10User-zeljkofilipin: Request for wmf group access for user: vwalters - https://phabricator.wikimedia.org/T324515 (10jhathaway) 05Open→03Resolved a:03jhathaway @vaughnwalters done!
[17:50:41] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@349e1cc] (hadoop-test): Deploy HDFS usage dataset generation scripts TEST [analytics/refinery@349e1cc] (duration: 01m 15s)
[17:51:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10sbassett)
[17:52:38] <steve_munene>	 Hello, about to update varnishkafka certificates which will entail,
[17:52:38] <steve_munene>	 Disabling puppet on all cp servers
[17:52:38] <steve_munene>	 Merging the changes made
[17:52:38] <steve_munene>	 verifying the keypair is updated
[17:52:38] <steve_munene>	 verifying restarting of the varnishkafka instance, if not perfornimg a restart
[17:52:39] <steve_munene>	 re enabling and running puppet on all varnishkafka instances
[17:52:39] <steve_munene>	 T323771
[17:52:39] <stashbot>	 T323771: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771
[17:53:17] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash1026']
[17:54:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[17:54:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[17:54:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10jhathaway) @KFrancis has @Muhammad_Yasser_Jazirahly_WMDE signed an NDA?
[17:56:15] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1026']
[17:56:38] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs5003: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/865720 (https://phabricator.wikimedia.org/T323830)
[17:57:41] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38632/console" [puppet] - 10https://gerrit.wikimedia.org/r/865720 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[17:58:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[17:58:42] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: lvs5003: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/865720 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[17:59:59] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add Vaughn Walters to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/865718 (https://phabricator.wikimedia.org/T324515) (owner: 10JHathaway)
[18:00:13] <wikibugs>	 (03PS1) 10Ssingh: lvs5006: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865722 (https://phabricator.wikimedia.org/T322048)
[18:00:56] <sukhe>	 !log restart pybal on lvs5003 to pick up bgp-med change
[18:00:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T322618)', diff saved to https://phabricator.wikimedia.org/P42478 and previous config saved to /var/cache/conftool/dbconfig/20221207-180058-ladsgroup.json
[18:01:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
[18:01:01] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[18:01:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42479 and previous config saved to /var/cache/conftool/dbconfig/20221207-180110-ladsgroup.json
[18:01:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[18:01:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance
[18:01:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T322618)', diff saved to https://phabricator.wikimedia.org/P42480 and previous config saved to /var/cache/conftool/dbconfig/20221207-180119-ladsgroup.json
[18:01:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[18:01:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42481 and previous config saved to /var/cache/conftool/dbconfig/20221207-180132-ladsgroup.json
[18:01:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42482 and previous config saved to /var/cache/conftool/dbconfig/20221207-180140-ladsgroup.json
[18:03:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[18:03:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2002.codfw.wmnet with OS bullseye
[18:03:27] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye completed: - sretest2002 (**PASS**)   - Downtimed on I...
[18:03:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T322618)', diff saved to https://phabricator.wikimedia.org/P42483 and previous config saved to /var/cache/conftool/dbconfig/20221207-180331-ladsgroup.json
[18:04:53] <logmsgbot>	 !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1026']
[18:05:48] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul)
[18:05:58] <wikibugs>	 (03PS7) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246)
[18:06:55] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1026.eqiad.wmnet with OS bullseye
[18:09:32] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul)
[18:09:54] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10jhathaway) @Nikerabbit when does their contract expire, so I can document it in our user database?
[18:10:46] <wikibugs>	 (03CR) 10Ottomata: flink and flink-kubernetes-operator image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[18:12:19] <wikibugs>	 (03PS8) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246)
[18:13:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for Dom Walden - https://phabricator.wikimedia.org/T323549 (10dom_walden) >>! In T323549#8451455, @jhathaway wrote: > @dom_walden done!  Thanks!
[18:14:31] <wikibugs>	 (03CR) 10Hnowlan: maps: remove tilerator and cassandra (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan)
[18:16:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42484 and previous config saved to /var/cache/conftool/dbconfig/20221207-181647-ladsgroup.json
[18:18:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42485 and previous config saved to /var/cache/conftool/dbconfig/20221207-181838-ladsgroup.json
[18:19:28] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:19:36] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] Add Vaughn Walters to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/865718 (https://phabricator.wikimedia.org/T324515) (owner: 10JHathaway)
[18:23:59] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1026.eqiad.wmnet with reason: host reimage
[18:26:18] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "LGTM, great work!" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[18:27:02] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1026.eqiad.wmnet with reason: host reimage
[18:27:22] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "In general this seems like it's on the right track.  Given the complexity, I wouldn't be shocked if we find we need minor post-merge fixup" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[18:28:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P42486 and previous config saved to /var/cache/conftool/dbconfig/20221207-182808-ladsgroup.json
[18:28:12] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[18:31:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42487 and previous config saved to /var/cache/conftool/dbconfig/20221207-183154-ladsgroup.json
[18:32:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[18:32:36] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[18:33:30] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "I suspect this is no longer needed with the x2 replicas removed from db config. Please confirm and close or clarify accordingly :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828072 (https://phabricator.wikimedia.org/T312809) (owner: 10Aaron Schulz)
[18:33:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[18:33:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42488 and previous config saved to /var/cache/conftool/dbconfig/20221207-183344-ladsgroup.json
[18:41:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Persistence (work done), 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Dzahn) Thank you @Marostegui , perfect :)
[18:42:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] mariadb: remove phab1001 from production-m3 grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn)
[18:42:46] <wikibugs>	 (03Abandoned) 10Dzahn: mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn)
[18:43:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P42489 and previous config saved to /var/cache/conftool/dbconfig/20221207-184315-ladsgroup.json
[18:45:42] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001"
[18:45:42] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1024.eqiad.wmnet with OS bullseye
[18:45:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye completed: - kubernetes1024 (**WARN...
[18:45:48] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001"
[18:45:48] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1023.eqiad.wmnet with OS bullseye
[18:45:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye completed: - kubernetes1023 (**WARN...
[18:47:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42490 and previous config saved to /var/cache/conftool/dbconfig/20221207-184700-ladsgroup.json
[18:47:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[18:47:05] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[18:47:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[18:47:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T322618)', diff saved to https://phabricator.wikimedia.org/P42491 and previous config saved to /var/cache/conftool/dbconfig/20221207-184722-ladsgroup.json
[18:47:53] <wikibugs>	 (03PS1) 10RobH: r650xs updates [software] - 10https://gerrit.wikimedia.org/r/865724
[18:48:18] <wikibugs>	 (03CR) 10RobH: [C: 03+2] r650xs updates [software] - 10https://gerrit.wikimedia.org/r/865724 (owner: 10RobH)
[18:48:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T322618)', diff saved to https://phabricator.wikimedia.org/P42492 and previous config saved to /var/cache/conftool/dbconfig/20221207-184830-ladsgroup.json
[18:48:48] <wikibugs>	 (03Merged) 10jenkins-bot: r650xs updates [software] - 10https://gerrit.wikimedia.org/r/865724 (owner: 10RobH)
[18:48:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T322618)', diff saved to https://phabricator.wikimedia.org/P42493 and previous config saved to /var/cache/conftool/dbconfig/20221207-184851-ladsgroup.json
[18:48:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[18:49:11] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1026.eqiad.wmnet with OS bullseye
[18:49:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[18:49:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance
[18:49:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance
[18:49:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
[18:49:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
[18:49:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T322618)', diff saved to https://phabricator.wikimedia.org/P42494 and previous config saved to /var/cache/conftool/dbconfig/20221207-184958-ladsgroup.json
[18:52:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T322618)', diff saved to https://phabricator.wikimedia.org/P42495 and previous config saved to /var/cache/conftool/dbconfig/20221207-185210-ladsgroup.json
[18:52:14] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[18:56:49] <wikibugs>	 (03CR) 10Dzahn: "moving ahead with this. contint1001 has been breaking. and this is existing group on new host which will turn into the same role. it needs" [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[18:56:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] contint: give RelEng access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[18:56:59] <wikibugs>	 (03PS3) 10Dzahn: contint: give RelEng access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[18:58:01] <wikibugs>	 (03PS4) 10Dzahn: contint: give RelEng access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[18:58:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P42496 and previous config saved to /var/cache/conftool/dbconfig/20221207-185821-ladsgroup.json
[19:00:04] <jouncebot>	 ^demon and dancy: #bothumor I � Unicode. All rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T1900).
[19:00:04] <jouncebot>	 ^demon and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T1900). nyaa~
[19:00:25] <TheresNoTime>	 hmm
[19:00:46] <wikibugs>	 (03CR) 10Dzahn: "You have now shell access." [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[19:02:23] <wikibugs>	 (03PS2) 10Dzahn: contint: add contint1002 as a scap target [puppet] - 10https://gerrit.wikimedia.org/r/865681 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[19:03:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42497 and previous config saved to /var/cache/conftool/dbconfig/20221207-190337-ladsgroup.json
[19:05:41] <wikibugs>	 (03PS4) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[19:06:13] <wikibugs>	 (03CR) 10Slyngshede: sre.ganeti.reimage: add new cookbook (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[19:07:08] <wikibugs>	 (03CR) 10Slyngshede: "Thanks, the comments helped a lot in clarifying the work needed to be done." [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[19:07:16] <wikibugs>	 (03PS5) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[19:07:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42498 and previous config saved to /var/cache/conftool/dbconfig/20221207-190717-ladsgroup.json
[19:07:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[19:08:06] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) I merged your change https://gerrit.wikimedia.org/r/c/operations/puppet/+/865672/4  so now...
[19:08:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[19:09:40] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Remove rsync quickdatacopy failover restrictions [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:10:25] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] netmon: Remove rsync quickdatacopy failover restrictions [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:11:05] <wikibugs>	 (03PS6) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[19:12:55] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:12:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[19:13:21] <wikibugs>	 (03Restored) 10Samtar: InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar)
[19:13:29] <wikibugs>	 (03PS2) 10Samtar: InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726)
[19:13:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P42499 and previous config saved to /var/cache/conftool/dbconfig/20221207-191328-ladsgroup.json
[19:13:32] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[19:15:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] hiera: reorder contint1001 entries [puppet] - 10https://gerrit.wikimedia.org/r/865649 (owner: 10Hashar)
[19:16:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/865649 (owner: 10Hashar)
[19:18:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42500 and previous config saved to /var/cache/conftool/dbconfig/20221207-191843-ladsgroup.json
[19:19:06] <wikibugs>	 (03PS7) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[19:20:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[19:21:45] <wikibugs>	 (03CR) 10Volans: "Addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans)
[19:22:07] <wikibugs>	 (03PS2) 10Volans: cumin: add an audit report for insetup servers [puppet] - 10https://gerrit.wikimedia.org/r/864729
[19:22:09] <wikibugs>	 (03PS1) 10Volans: profile::cumin: use bool2str to simplify code [puppet] - 10https://gerrit.wikimedia.org/r/865728
[19:22:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42501 and previous config saved to /var/cache/conftool/dbconfig/20221207-192223-ladsgroup.json
[19:22:44] <wikibugs>	 (03PS8) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[19:24:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede)
[19:25:21] <wikibugs>	 (03PS9) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[19:28:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10KFrancis) @jhathaway, I don't have one on file, but can process one.  I'll need Muhammad Jaziraly's WMDE email address.  Please send that to kfrancis@wikimedia.org.  Thanks!
[19:33:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T322618)', diff saved to https://phabricator.wikimedia.org/P42502 and previous config saved to /var/cache/conftool/dbconfig/20221207-193350-ladsgroup.json
[19:33:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[19:33:55] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[19:34:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[19:34:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[19:34:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[19:34:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:34:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:34:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T322618)', diff saved to https://phabricator.wikimedia.org/P42503 and previous config saved to /var/cache/conftool/dbconfig/20221207-193445-ladsgroup.json
[19:35:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T322618)', diff saved to https://phabricator.wikimedia.org/P42504 and previous config saved to /var/cache/conftool/dbconfig/20221207-193553-ladsgroup.json
[19:36:45] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Add the netmon2002 instance as a ganeti rapi node. [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[19:37:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T322618)', diff saved to https://phabricator.wikimedia.org/P42505 and previous config saved to /var/cache/conftool/dbconfig/20221207-193730-ladsgroup.json
[19:37:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance
[19:37:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance
[19:37:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42506 and previous config saved to /var/cache/conftool/dbconfig/20221207-193751-ladsgroup.json
[19:40:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42507 and previous config saved to /var/cache/conftool/dbconfig/20221207-194003-ladsgroup.json
[19:40:07] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[19:43:55] <wikibugs>	 (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[19:50:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10jhathaway) @KFrancis, email sent, thanks!
[19:50:43] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/865680/38635/" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[19:51:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42508 and previous config saved to /var/cache/conftool/dbconfig/20221207-195100-ladsgroup.json
[19:51:05] <wikibugs>	 (03CR) 10Dzahn: "deploying first on registry hosts, then contint old, then contint new..wip" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[19:51:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] contint: add ci::master to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[19:51:21] <wikibugs>	 (03PS3) 10Dzahn: contint: add ci::master to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[19:53:00] <wikibugs>	 (03PS1) 10Southparkfan: rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731
[19:53:23] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add lvs5006 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865732 (https://phabricator.wikimedia.org/T322048)
[19:53:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (owner: 10Southparkfan)
[19:53:55] <mutante>	 !log registry* (docker registry HA) - adding contint1002 to allowed hosts gerrit:865680 T313832
[19:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:59] <stashbot>	 T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832
[19:55:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42509 and previous config saved to /var/cache/conftool/dbconfig/20221207-195510-ladsgroup.json
[19:56:13] <wikibugs>	 (03PS2) 10Southparkfan: rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623)
[19:56:21] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) 05In progress→03Open a:05RobH→03None
[19:56:42] <wikibugs>	 (03CR) 10Dzahn: "deployed on registry*, deployed on contint2002 (noop)" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[19:57:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[19:57:02] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) a:03ssingh @ssingh,  Once the final OS installations are completed please resolve this task.  Thanks!
[19:58:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed on contint2001, contint1001 (firewall only changes)" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[19:59:02] <wikibugs>	 (03PS3) 10Southparkfan: rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623)
[19:59:15] <wikibugs>	 (03PS12) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064)
[19:59:54] <wikibugs>	 (03CR) 10Ryan Kemper: add grizzly dashboard for WDQS uptime (033 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[20:00:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Antoine, it's fine with existing servers but for the new server it's missing something:" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[20:00:13] <mutante>	 !log contint* - deploying firewall changes to add contint1002 - T313832
[20:00:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:17] <stashbot>	 T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832
[20:02:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "it's because these are done based on host names, based to avoid that if we can:" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[20:04:49] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865733 (https://phabricator.wikimedia.org/T320518)
[20:04:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865733 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot)
[20:05:21] <wikibugs>	 10SRE, 10Cloud-Services, 10observability, 10Patch-For-Review, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10Southparkfan) I have tested https://gerrit.wikimedia.org/r/c/operations/puppet/+/865731 by using `rsyslog-openssl` on one syslog client and one syslog...
[20:05:44] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865733 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot)
[20:06:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42510 and previous config saved to /var/cache/conftool/dbconfig/20221207-200606-ladsgroup.json
[20:06:56] <wikibugs>	 (03PS1) 10Dzahn: contint: add docker::settings for contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865734 (https://phabricator.wikimedia.org/T313832)
[20:07:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/865734/" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[20:08:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] contint: add docker::settings for contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865734 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn)
[20:09:49] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on contint1002.wikimedia.org with reason: new setup
[20:10:04] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on contint1002.wikimedia.org with reason: new setup
[20:10:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42511 and previous config saved to /var/cache/conftool/dbconfig/20221207-201016-ladsgroup.json
[20:13:49] <logmsgbot>	 !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.13  refs T320518
[20:13:52] <stashbot>	 T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518
[20:14:03] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10Jclark-ctr) 05Open→03Resolved replaced optic and moved to new port
[20:16:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: (4) CirrusSearch job topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite is heavily backlogged with 6.211M messages - TODO  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[20:18:38] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:20:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "confirmed no-op where it counts: https://puppet-compiler.wmflabs.org/output/865731/38636/centrallog1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[20:20:53] <logmsgbot>	 !log demon@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.13  refs T320518 (duration: 07m 03s)
[20:20:58] <stashbot>	 T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518
[20:21:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: (4) CirrusSearch job topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite is heavily backlogged with 1.574M messages - TODO  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[20:21:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T322618)', diff saved to https://phabricator.wikimedia.org/P42512 and previous config saved to /var/cache/conftool/dbconfig/20221207-202113-ladsgroup.json
[20:21:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[20:21:17] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[20:21:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[20:21:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T322618)', diff saved to https://phabricator.wikimedia.org/P42513 and previous config saved to /var/cache/conftool/dbconfig/20221207-202134-ladsgroup.json
[20:22:30] <inflatador>	 ^^ anyone doing any maintenances that would explain those CirrusSearch job queue alerts?
[20:23:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T322618)', diff saved to https://phabricator.wikimedia.org/P42514 and previous config saved to /var/cache/conftool/dbconfig/20221207-202343-ladsgroup.json
[20:24:12] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:24:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:25:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42515 and previous config saved to /var/cache/conftool/dbconfig/20221207-202524-ladsgroup.json
[20:25:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance
[20:25:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance
[20:25:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42516 and previous config saved to /var/cache/conftool/dbconfig/20221207-202545-ladsgroup.json
[20:25:59] <RhinosF1>	 inflatador: when did they start? (I'm not but might help track down a related SAL entry or something)
[20:26:31] <wikibugs>	 (03PS1) 10Dzahn: ci: move docker::settings to common, avoid host names [puppet] - 10https://gerrit.wikimedia.org/r/865735 (https://phabricator.wikimedia.org/T313832)
[20:27:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "Commit message needs some minor mending, other than that LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert)
[20:27:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42517 and previous config saved to /var/cache/conftool/dbconfig/20221207-202758-ladsgroup.json
[20:28:02] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[20:30:09] <wikibugs>	 (03PS5) 10Andrew Bogott: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717)
[20:31:27] <wikibugs>	 (03CR) 10Dzahn: "This does not work on a new contint master. When the ci::master role was applied now on contint1002 the contint-admins group is not create" [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond)
[20:31:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10jijiki)
[20:33:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "most things worked after the one follow-up above. We do have some remaining issues though, or at least one which comes from:" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[20:34:03] <inflatador>	 RhinosF1 first alert popped around 2016 UTC (~20m ago) 
[20:34:21] <inflatador>	 I see a DB maintenance, maybe that could explain it?
[20:34:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "And no worries, I have confirmed jenkins, zuul and zuul-merger are dead and masked." [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[20:36:06] <RhinosF1>	 inflatador: DB maintenance happens near 24/7 now
[20:36:18] <RhinosF1>	 I was more thinking that the alert matched the train
[20:36:23] <inflatador>	 ah, maybe a red herring then
[20:36:39] <wikibugs>	 (03PS1) 10Effie Mouzeli: site.pp Productionise mc20[39-55] [puppet] - 10https://gerrit.wikimedia.org/r/865736 (https://phabricator.wikimedia.org/T293012)
[20:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[20:37:07] <RhinosF1>	 inflatador: the alert resolved didn't it so could it be something that was transient during sync?
[20:37:40] <RhinosF1>	 Is there any other error mediawiki side to show why they might have backed up / failed / been generated more than normal?
[20:38:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs5006: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865722 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[20:38:24] <wikibugs>	 (03PS2) 10Ssingh: lvs5006: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865722 (https://phabricator.wikimedia.org/T322048)
[20:38:34] <inflatador>	 RhinosF1 yeah, that's what I'm curious about myself. I found a kafka dashboard ( https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?forceLogin&orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eqiad.mediawiki.job.cirrusSearchElasticaWrite ) but I'm not sure it has any useful info
[20:38:39] <wikibugs>	 (03PS1) 10DDesouza: Remove Research Incentive survey from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930)
[20:38:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42519 and previous config saved to /var/cache/conftool/dbconfig/20221207-203849-ladsgroup.json
[20:39:15] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] site.pp Productionise mc20[39-55] [puppet] - 10https://gerrit.wikimedia.org/r/865736 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli)
[20:40:00] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Set netmon2002 the main instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse)
[20:40:19] <wikibugs>	 (03PS2) 10DDesouza: Remove Research Incentive survey from frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930)
[20:40:26] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Remove the netmon2001 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[20:40:48] <RhinosF1>	 inflatador: not really sure either where would be looked at. Probably a serviceops question if it's concerning to you.
[20:40:49] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Remove netmon2001 from the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[20:41:59] <wikibugs>	 (03PS1) 10JHathaway: Add Jennifer Hancock to datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/865738 (https://phabricator.wikimedia.org/T324585)
[20:42:30] <inflatador>	 No worries, it's not urgent ATM
[20:43:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42520 and previous config saved to /var/cache/conftool/dbconfig/20221207-204304-ladsgroup.json
[20:43:15] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5006.eqsin.wmnet with OS buster
[20:43:26] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5006.eqsin.wmnet with OS buster
[20:43:43] <wikibugs>	 (03PS1) 10Dzahn: ci::master: hack to bootstrap new server contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865739 (https://phabricator.wikimedia.org/T313832)
[20:44:04] <wikibugs>	 (03PS1) 10DDesouza: Remove Research Incentive survey from swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865740 (https://phabricator.wikimedia.org/T321252)
[20:47:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ci::master: hack to bootstrap new server contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865739 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn)
[20:47:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson)
[20:47:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson) 05Open→03Resolved completed
[20:48:28] <wikibugs>	 (03PS1) 10Ssingh: lvs5006: set as secondary LVS and remove lvs5003 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865742 (https://phabricator.wikimedia.org/T323830)
[20:49:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] lvs5006: set as secondary LVS and remove lvs5003 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865742 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[20:50:11] <wikibugs>	 (03PS1) 10Dzahn: Revert "ci::master: hack to bootstrap new server contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865525
[20:50:42] <wikibugs>	 (03PS1) 10DDesouza: Deploy Research Incentive survey on yowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865744 (https://phabricator.wikimedia.org/T321249)
[20:51:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:52:23] <wikibugs>	 (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/865742 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[20:53:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42521 and previous config saved to /var/cache/conftool/dbconfig/20221207-205356-ladsgroup.json
[20:55:53] <wikibugs>	 (03PS5) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[20:58:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42522 and previous config saved to /var/cache/conftool/dbconfig/20221207-205811-ladsgroup.json
[20:58:27] <wikibugs>	 (03CR) 10Ottomata: "> I would also argue not to remove things from the chart that can just stay disabled/unused" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[20:58:37] <wikibugs>	 (03PS6) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[20:59:07] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 137 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T2100)
[21:00:04] <jouncebot>	 duesen and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:10] <danisztls>	 o/
[21:01:36] <duesen>	 o/
[21:01:42] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper)
[21:02:10] <mutante>	 !log contint1002 a2dismod mpm_event  - https://phabricator.wikimedia.org/T208108 Bug: T313832
[21:02:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:15] <stashbot>	 T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832
[21:03:01] <TheresNoTime>	 o/ I can deploy
[21:03:39] <duesen>	 TheresNoTime: awesome :) I can also self service, but tbh, it's late, and I have had a bit of a day...
[21:03:59] <TheresNoTime>	 duesen: no worries :D where were you wanting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/864838/ deployed, assuming a backport..?
[21:04:12] <wikibugs>	 (03PS3) 10Samtar: hewiki: enable parser cache writes for parsoid's page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865070 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler)
[21:04:23] <duesen>	 TheresNoTime: oh crud yes, i didn't cherry-pick. give me a sec
[21:05:53] <wikibugs>	 (03PS3) 10Samtar: Page 5% of calls to parsoid's page/html endpoint write to PC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865071 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler)
[21:05:57] <duesen>	 TheresNoTime: hm, the cherry pick failed. can you do the config patches first? they can go in together, at the same time
[21:06:07] <TheresNoTime>	 duesen: sure, will do now
[21:06:09] <duesen>	 I'll figure out what's up with the DiscussionTools patch
[21:06:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865070 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler)
[21:06:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865071 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler)
[21:07:37] <wikibugs>	 (03Merged) 10jenkins-bot: hewiki: enable parser cache writes for parsoid's page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865070 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler)
[21:07:39] <wikibugs>	 (03PS1) 10Dzahn: Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746
[21:07:41] <wikibugs>	 (03Merged) 10jenkins-bot: Page 5% of calls to parsoid's page/html endpoint write to PC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865071 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler)
[21:07:57] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage
[21:08:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn)
[21:08:12] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:865070|hewiki: enable parser cache writes for parsoid's page/html endpoint. (T322672 T320534 T320529)]], [[gerrit:865071|Page 5% of calls to parsoid's page/html endpoint write to PC (T322672)]]
[21:08:18] <stashbot>	 T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529
[21:08:18] <stashbot>	 T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534
[21:08:18] <stashbot>	 T322672: Make ParsoidHandler::wt2html write to parser cache - https://phabricator.wikimedia.org/T322672
[21:09:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T322618)', diff saved to https://phabricator.wikimedia.org/P42523 and previous config saved to /var/cache/conftool/dbconfig/20221207-210902-ladsgroup.json
[21:09:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[21:09:06] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[21:09:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[21:09:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42524 and previous config saved to /var/cache/conftool/dbconfig/20221207-210923-ladsgroup.json
[21:10:05] <logmsgbot>	 !log samtar@deploy1002 samtar and daniel: Backport for [[gerrit:865070|hewiki: enable parser cache writes for parsoid's page/html endpoint. (T322672 T320534 T320529)]], [[gerrit:865071|Page 5% of calls to parsoid's page/html endpoint write to PC (T322672)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[21:10:46] <wikibugs>	 (03PS2) 10Dzahn: Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746
[21:10:52] <TheresNoTime>	 duesen: those patches are live on mwdebug, but just FYI I'm looking at T324711, a lot of busy exception logs
[21:10:52] <stashbot>	 T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711
[21:11:07] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage
[21:11:10] <TheresNoTime>	 (unrelated to yours, just worrying :D)
[21:11:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn)
[21:11:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42525 and previous config saved to /var/cache/conftool/dbconfig/20221207-211132-ladsgroup.json
[21:11:47] <duesen>	 TheresNoTime: nvm the DiscussionTools patch, it's already on the branch, it got merged before the branch cut 
[21:11:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn)
[21:12:00] <icinga-wm>	 PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - free space: / 706 MB (3% inode=91%): /tmp 706 MB (3% inode=91%): /var/tmp 706 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops
[21:12:03] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Paired with Daniel.  That was my mistake, I made a first change to get shell access which only added contint-root, the next change adding " [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn)
[21:12:21] <TheresNoTime>	 duesen: ack okay — can you test those config patches?
[21:12:38] <duesen>	 TheresNoTime: I'll try to test the config patch on debug, though I don't think I'll be able to see much. 
[21:12:39] <wikibugs>	 (03PS7) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[21:12:55] <wikibugs>	 (03PS3) 10Dzahn: Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746
[21:13:09] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2] Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn)
[21:13:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42526 and previous config saved to /var/cache/conftool/dbconfig/20221207-211317-ladsgroup.json
[21:13:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance
[21:13:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance
[21:13:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "ci::master: hack to bootstrap new server contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865525 (owner: 10Dzahn)
[21:13:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42527 and previous config saved to /var/cache/conftool/dbconfig/20221207-211338-ladsgroup.json
[21:14:25] <wikibugs>	 (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[21:14:50] <wikibugs>	 (03CR) 10Dzahn: "You can ignore my comments here. We found the _actual_ cause of the issue and it wasn't this :)" [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond)
[21:15:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42528 and previous config saved to /var/cache/conftool/dbconfig/20221207-211551-ladsgroup.json
[21:15:55] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[21:17:11] <duesen>	 TheresNoTime: everything looks fine. Whether it actually is, we'll know once restbase starts hitting it.
[21:17:49] <duesen>	 TheresNoTime: if you merge them, i'll keep an eye on the metrics
[21:18:01] <wikibugs>	 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) First OS install was done with the first 2 ssd's in software raid 1 and was able to see the Nvme as well ` Disk /dev/nvme0n1: 5.82 TiB, 6401252745216 bytes, 1562805846 sectors Disk model: WUS4CB064D7P3E3...
[21:18:27] <TheresNoTime>	 duesen: okay - I'm a little concerned with T324711, not entirely sure if I should merge while we're seeing that many exceptions post-train
[21:18:27] <stashbot>	 T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711
[21:18:35] <duesen>	 TheresNoTime: I will start to look at T324711 as well. May be related to my work (not the backport patches though). 
[21:20:59] <TheresNoTime>	 duesen: should I merge the config patches, or would you prefer to address that first?
[21:22:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: raise H2 compaction time [puppet] - 10https://gerrit.wikimedia.org/r/865023 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar)
[21:22:43] <duesen>	 TheresNoTime: please merge the config patches
[21:22:51] <TheresNoTime>	 ack
[21:24:12] <wikibugs>	 (03PS1) 10Stang: specieswiki: Install GeoData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865766 (https://phabricator.wikimedia.org/T324348)
[21:25:16] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10ops-codfw, 10Patch-For-Review: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10jhathaway) @wiki_willy the datacenter-ops group is a local group which grants access to a number of sudo commands needed for datacenter work. The [[ https://...
[21:25:29] <cirno>	 Hi TheresNoTime, would you mind taking care of one more patch ^^
[21:26:02] <TheresNoTime>	 cirno: sure, there's one more ahead of you
[21:26:33] <wikibugs>	 (03PS3) 10Samtar: Remove Research Incentive survey from frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930) (owner: 10DDesouza)
[21:26:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10ops-codfw, 10Patch-For-Review: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10wiki_willy) Yup, that's correct.  Thanks @jhathaway!  >>! In T324585#8452314, @jhathaway wrote: > @wiki_willy the datacenter-ops group is a local group which...
[21:26:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42529 and previous config saved to /var/cache/conftool/dbconfig/20221207-212638-ladsgroup.json
[21:27:16] <cirno>	 I have added this one on the board. TheresNoTime: this patch require https://gerrit.wikimedia.org/r/863442/, could you please have a look and give a +2?
[21:27:46] <wikibugs>	 (03PS2) 10JHathaway: Add Jennifer Hancock to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/865738 (https://phabricator.wikimedia.org/T324585)
[21:27:57] <TheresNoTime>	 cirno: looking
[21:28:00] <wikibugs>	 (03CR) 10Hashar: [V: 04-1] "We can dig in the history, but I think the partition name is local to the host. That comes from when we migrated from /mnt to /srv Ic0c805" [puppet] - 10https://gerrit.wikimedia.org/r/865735 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn)
[21:28:47] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865070|hewiki: enable parser cache writes for parsoid's page/html endpoint. (T322672 T320534 T320529)]], [[gerrit:865071|Page 5% of calls to parsoid's page/html endpoint write to PC (T322672)]] (duration: 20m 35s)
[21:28:53] <stashbot>	 T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529
[21:28:54] <stashbot>	 T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534
[21:28:54] <stashbot>	 T322672: Make ParsoidHandler::wt2html write to parser cache - https://phabricator.wikimedia.org/T322672
[21:29:06] <TheresNoTime>	 duesen: those config patches should be live now
[21:29:24] <TheresNoTime>	 danisztls: doing 865737 now
[21:29:30] <danisztls>	 TheresNoTime: Thanks!
[21:29:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930) (owner: 10DDesouza)
[21:30:28] <wikibugs>	 (03Merged) 10jenkins-bot: Remove Research Incentive survey from frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930) (owner: 10DDesouza)
[21:30:52] <dancy>	 TheresNoTime, I think I'm going to roll back the train when you're done.
[21:30:55] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:865737|Remove Research Incentive survey from frwiki (T321930)]]
[21:30:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42530 and previous config saved to /var/cache/conftool/dbconfig/20221207-213057-ladsgroup.json
[21:30:58] <stashbot>	 T321930: Deploy Research Incentive Survey targeting Sub-Saharan Africa on French Wikipedia - https://phabricator.wikimedia.org/T321930
[21:30:59] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[21:31:10] <TheresNoTime>	 dancy: ack
[21:31:44] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10ops-codfw, 10Patch-For-Review: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10jhathaway) 05Open→03Resolved a:03jhathaway done!
[21:32:07] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002"
[21:32:08] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5006.eqsin.wmnet with OS buster
[21:32:18] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5006.eqsin.wmnet with OS buster completed: - lvs5006 (**PASS**)...
[21:32:47] <logmsgbot>	 !log samtar@deploy1002 samtar and dani: Backport for [[gerrit:865737|Remove Research Incentive survey from frwiki (T321930)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[21:32:50] <TheresNoTime>	 danisztls: that's live on mwdebug now, can you test?
[21:32:59] <danisztls>	 TheresNoTime: yes
[21:33:07] <danisztls>	 any mwdebug?
[21:33:14] <TheresNoTime>	 any :)
[21:33:37] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs5006 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865732 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh)
[21:33:48] <danisztls>	 TheresNoTime: it looks fine
[21:33:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10jhathaway) @AnnWF is this now a duplicate, since you were added to analytics_privatedata_users in https://phabricator.wikimedia.org/T324057?
[21:33:56] <TheresNoTime>	 syncing
[21:34:51] <sukhe>	 !log homer "cr*-eqsin*" commit "running homer for Gerrit: 865742"
[21:34:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:52] <wikibugs>	 (03PS2) 10Samtar: specieswiki: Install GeoData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865766 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang)
[21:36:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs5003.eqsin.wmnet with reason: downtimed, in the process of decom
[21:36:32] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs5003.eqsin.wmnet with reason: downtimed, in the process of decom
[21:36:55] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs5003.eqsin.wmnet
[21:38:34] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs5003 [homer/public] - 10https://gerrit.wikimedia.org/r/865773 (https://phabricator.wikimedia.org/T323830)
[21:39:59] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865737|Remove Research Incentive survey from frwiki (T321930)]] (duration: 09m 04s)
[21:40:02] <TheresNoTime>	 danisztls: that should be live now :)
[21:40:02] <stashbot>	 T321930: Deploy Research Incentive Survey targeting Sub-Saharan Africa on French Wikipedia - https://phabricator.wikimedia.org/T321930
[21:40:15] <TheresNoTime>	 cirno: doing 865766 now
[21:40:20] <danisztls>	 TheresNoTime: thanks
[21:40:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865766 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang)
[21:40:41] <duesen>	 TheresNoTime: subbu  and scott and I are looking into the bug. i have a good idea what it is. but not how to fix it, really
[21:40:55] <TheresNoTime>	 :((
[21:41:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm!  Arnold, do you wanna merge it, watch what puppet adds and test starting that new systemd unit it creates? might be interesting for " [puppet] - 10https://gerrit.wikimedia.org/r/865674 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[21:41:20] <wikibugs>	 (03Merged) 10jenkins-bot: specieswiki: Install GeoData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865766 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang)
[21:41:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[21:41:44] <wikibugs>	 (03PS3) 10Dzahn: phabricator: rm code from before system user was created with systemd [puppet] - 10https://gerrit.wikimedia.org/r/865208
[21:41:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42532 and previous config saved to /var/cache/conftool/dbconfig/20221207-214145-ladsgroup.json
[21:41:46] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:865766|specieswiki: Install GeoData extension (T324348)]]
[21:41:50] <stashbot>	 T324348: Add Extension:GeoData to Wikispecies wiki - https://phabricator.wikimedia.org/T324348
[21:42:01] <cirno>	 TheresNoTime: have you run the script createExtensionTables to create tables?
[21:42:50] <TheresNoTime>	 cirno: nope.. will do now!
[21:43:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5003.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[21:43:39] <logmsgbot>	 !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:865766|specieswiki: Install GeoData extension (T324348)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:43:51] <wikibugs>	 (03PS1) 10Brion VIBBER: Use blubber via Docker tooling; no longer requires local binary [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/865779
[21:44:30] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5003.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[21:44:30] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:44:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs5003.eqsin.wmnet
[21:44:39] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs5003.eqsin.wmnet` - lvs5003.eqsin.wmnet...
[21:44:42] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh)
[21:44:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] lvs5006: set as secondary LVS and remove lvs5003 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865742 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[21:45:00] <TheresNoTime>	 cirno: one moment
[21:45:53] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs5003 [homer/public] - 10https://gerrit.wikimedia.org/r/865773 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh)
[21:46:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42533 and previous config saved to /var/cache/conftool/dbconfig/20221207-214603-ladsgroup.json
[21:47:18] <sukhe>	 !log homer "cr*-eqsin*" commit "running homer for Gerrit: 865773"
[21:47:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:46] <TheresNoTime>	 cirno: I will need to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/863442, may be worth rescheduling your patch?
[21:48:30] <TheresNoTime>	 or will I..?
[21:48:34] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh)
[21:48:38] <cirno>	 it's ok, I could re-schedule the extension install patch
[21:49:05] <cirno>	 or should I do a backport of WikimediaMaintenance?
[21:49:15] <cirno>	 (I mean, by myself
[21:49:29] <logmsgbot>	 !log samtar@deploy1002 Sync cancelled.
[21:50:27] <TheresNoTime>	 cirno: let's reschedule, given there's also T324711 going on and dancy wants to roll back the train. I'll revert that patch I merged
[21:50:28] <stashbot>	 T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711
[21:50:57] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) 05Open→03Resolved Thanks to @RobH, @Papaul, @Bblack, @cmooney, @MoritzMuehlenhoff, @Volans for all their help in the eqsin refresh.
[21:51:17] <logmsgbot>	 !log samtar@deploy1002 backport aborted:  (duration: 00m 15s)
[21:51:55] <cirno>	 TheresNoTime: got it and agree to postpone, what do you think is the time for next schecule?
[21:52:03] <wikibugs>	 (03PS1) 10Samtar: Revert "specieswiki: Install GeoData extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865747
[21:54:52] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] Revert "specieswiki: Install GeoData extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865747 (owner: 10Samtar)
[21:55:37] <TheresNoTime>	 cirno: as long as that WikimediaMaintenance change is available - next window maybe?
[21:56:07] <TheresNoTime>	 dancy: done, all yours
[21:56:14] <dancy>	 thx!
[21:56:20] <TheresNoTime>	 !log UTC late backport window done
[21:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42534 and previous config saved to /var/cache/conftool/dbconfig/20221207-215651-ladsgroup.json
[21:56:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[21:56:55] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[21:57:06] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "specieswiki: Install GeoData extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865747 (owner: 10Samtar)
[21:57:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[21:57:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T322618)', diff saved to https://phabricator.wikimedia.org/P42535 and previous config saved to /var/cache/conftool/dbconfig/20221207-215712-ladsgroup.json
[21:57:18] <wikibugs>	 (03PS1) 10Stang: createExtensionTables: Add extension GeoData [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865748 (https://phabricator.wikimedia.org/T324348)
[21:59:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T322618)', diff saved to https://phabricator.wikimedia.org/P42536 and previous config saved to /var/cache/conftool/dbconfig/20221207-215921-ladsgroup.json
[22:01:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42537 and previous config saved to /var/cache/conftool/dbconfig/20221207-220110-ladsgroup.json
[22:09:11] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:11:41] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1019 is OK: PROCS OK: 5 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:13:22] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: Bring wdqs20[09,10,11,12] online [puppet] - 10https://gerrit.wikimedia.org/r/862369 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking)
[22:14:06] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] wdqs: Bring wdqs20[09,10,11,12] online [puppet] - 10https://gerrit.wikimedia.org/r/862369 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking)
[22:14:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42538 and previous config saved to /var/cache/conftool/dbconfig/20221207-221427-ladsgroup.json
[22:14:44] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: Bring wdqs20[09,10,11,12] online [puppet] - 10https://gerrit.wikimedia.org/r/862369 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking)
[22:23:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[22:25:01] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[22:25:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer
[22:25:37] <duesen>	 TheresNoTime: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/865785 is the fix we want, never mind the other one for now.
[22:25:44] <duesen>	 TheresNoTime: having both doesn't hurt.
[22:26:33] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97)
[22:26:48] <TheresNoTime>	 ah :D
[22:28:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[22:29:00] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[22:29:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42539 and previous config saved to /var/cache/conftool/dbconfig/20221207-222934-ladsgroup.json
[22:29:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[22:29:54] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[22:30:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[22:30:25] <TheresNoTime>	 duesen: guessing you're going to want 865785 backported?
[22:32:12] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[22:32:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[22:33:34] <duesen>	 TheresNoTime: yes, please. 
[22:34:16] * TheresNoTime is available to do that unless anyone else would prefer to?
[22:34:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 110 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:34:57] <duesen>	 TheresNoTime: if you could do it, that would really help. I'm in zombie mode at this point. Need to sleep. I hope subbu can help in case something goes wrong.
[22:35:20] <TheresNoTime>	 sure :)
[22:35:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[22:35:34] <subbu>	 worst case, we roll back train to group 0.
[22:36:01] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=wdqs2010.*
[22:36:17] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=wdqs2009.*
[22:36:42] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 533 bytes in 1.225 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:37:43] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[22:39:54] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-categories on wdqs2012 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:40:46] <wikibugs>	 (03PS1) 10Samtar: Make parsoid accept all content models. [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865749 (https://phabricator.wikimedia.org/T324711)
[22:41:15] <wikibugs>	 (03PS1) 10Bking: wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788
[22:41:40] <ryankemper>	 !log T301167 Downtimed `wdqs20[09-12]` for 7 days
[22:41:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:44] <stashbot>	 T301167: Service implementation for wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T301167
[22:42:53] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "Backporting" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865749 (https://phabricator.wikimedia.org/T324711) (owner: 10Samtar)
[22:43:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788 (owner: 10Bking)
[22:44:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T322618)', diff saved to https://phabricator.wikimedia.org/P42540 and previous config saved to /var/cache/conftool/dbconfig/20221207-224440-ladsgroup.json
[22:44:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[22:44:45] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[22:44:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[22:45:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T322618)', diff saved to https://phabricator.wikimedia.org/P42541 and previous config saved to /var/cache/conftool/dbconfig/20221207-224502-ladsgroup.json
[22:46:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T322618)', diff saved to https://phabricator.wikimedia.org/P42542 and previous config saved to /var/cache/conftool/dbconfig/20221207-224610-ladsgroup.json
[22:47:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[22:48:31] <TheresNoTime>	 !log Going to backport [[gerrit:865749]] to wmf/1.40.0-wmf.13 for T324711
[22:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:35] <stashbot>	 T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711
[22:48:40] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[22:49:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[22:49:35] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[22:49:54] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[22:49:55] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[22:49:56] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-categories on wdqs2012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[22:50:06] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[22:51:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[22:51:28] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[22:51:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload
[22:54:39] <wikibugs>	 (03PS2) 10Bking: wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788
[22:54:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:55:15] <wikibugs>	 (03PS3) 10Bking: wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788
[22:55:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865646 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[22:55:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 128 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:56:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:56:55] <wikibugs>	 (03PS1) 10RLazarus: Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789
[22:56:57] <wikibugs>	 (03PS1) 10RLazarus: Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707)
[22:56:59] <wikibugs>	 (03PS1) 10RLazarus: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707)
[22:58:14] <wikibugs>	 (03CR) 10Dzahn: "Arnold, if you are here tomorrow, maybe you can chat with Antoine (hashar) and merge this for him when he says it's ready to go?" [puppet] - 10https://gerrit.wikimedia.org/r/865681 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar)
[22:58:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[22:58:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[22:58:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus)
[22:58:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860905 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[22:58:52] <wikibugs>	 (03PS3) 10Dzahn: phabricator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860905 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[22:59:55] <wikibugs>	 (03Merged) 10jenkins-bot: Make parsoid accept all content models. [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865749 (https://phabricator.wikimedia.org/T324711) (owner: 10Samtar)
[23:00:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865749 (https://phabricator.wikimedia.org/T324711) (owner: 10Samtar)
[23:00:48] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:865749|Make parsoid accept all content models. (T324711)]]
[23:00:52] <stashbot>	 T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711
[23:01:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42543 and previous config saved to /var/cache/conftool/dbconfig/20221207-230116-ladsgroup.json
[23:01:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:02:45] <logmsgbot>	 !log samtar@deploy1002 samtar and samtar: Backport for [[gerrit:865749|Make parsoid accept all content models. (T324711)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[23:02:58] * TheresNoTime is testing
[23:08:12] <subbu>	 I still see the same errors: "/srv/mediawiki/php-1.40.0-wmf.13/includes/parser/Parsoid/ParsoidOutputAccess.php:196" ... but, on master, that isn't the right line anymore.
[23:08:26] <subbu>	 oh, testservers only .. never mind. ignore me.
[23:09:29] <subbu>	 TheresNoTime, parsoid requests go to parse200* cluster btw. not sure if mwdebug* will let you test this.
[23:09:51] <subbu>	 s/cluster/servers 
[23:09:55] <TheresNoTime>	 ahhhh
[23:09:58] * TheresNoTime syncs
[23:10:21] <mutante>	 the parsoid canary servers are parse2001/2002 and parse1001/1002
[23:10:28] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:10:35] <subbu>	 mutante, ah .. good to know.
[23:12:30] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:12:33] <TheresNoTime>	 should be starting to see a drop off in exceptions now
[23:12:55] <jinxer-wm>	 (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:14:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 45 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:14:25] <subbu>	 TheresNoTime, looks like it has.
[23:14:36] <subbu>	 icinga was faster than me.
[23:14:46] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865749|Make parsoid accept all content models. (T324711)]] (duration: 13m 57s)
[23:14:48] <TheresNoTime>	 phew
[23:14:50] <stashbot>	 T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711
[23:15:02] <wikibugs>	 (03PS4) 10Dzahn: phabricator: rm code from before system user was created with systemd [puppet] - 10https://gerrit.wikimedia.org/r/865208 (https://phabricator.wikimedia.org/T280597)
[23:15:26] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/865208/38639/" [puppet] - 10https://gerrit.wikimedia.org/r/865208 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn)
[23:15:26] <icinga-wm>	 PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - free space: / 720 MB (3% inode=91%): /tmp 720 MB (3% inode=91%): /var/tmp 720 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops
[23:16:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42544 and previous config saved to /var/cache/conftool/dbconfig/20221207-231623-ladsgroup.json
[23:23:20] <mutante>	 !log mx1001 - apt-get clean, gzip /var/log/exim4/mainlog.1  find -mtime +31 -delete in /var/log/exim4 - deleting old logs to prevent mail server running out of disk - it was alerting in Icinga but same as conf* - monitoring works, alerting does not
[23:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:24:22] <mutante>	 !log mx1001 about to run out of disk again -  apt-get clean, gzip /var/log/exim4/mainlog.1  find -mtime +31 -delete in /var/log/exim4 - deleting old logs to prevent mail server running out of disk - it was alerting in Icinga but same as conf* - monitoring works, alerting does not T305567
[23:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:24:26] <stashbot>	 T305567: MX: increasing disk space - https://phabricator.wikimedia.org/T305567
[23:24:54] <wikibugs>	 (03CR) 10Jforrester: ci: move docker::settings to common, avoid host names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865735 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn)
[23:25:59] <wikibugs>	 (03CR) 10Dzahn: ci: move docker::settings to common, avoid host names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865735 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn)
[23:26:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:26:38] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:27:18] <subbu>	 TheresNoTime, alright, I'm going to relocate and will be unavailable for a bit ... but it looks like all is well so far.
[23:27:47] <TheresNoTime>	 subbu: I'll be around and will keep an eye for a bit, but looking okay
[23:27:56] <subbu>	 perfect. thanks!
[23:31:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10Dzahn) I think the priority is surprisingly low for this being the main prod mail server and almost running out of disk multiple times.
[23:31:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T322618)', diff saved to https://phabricator.wikimedia.org/P42545 and previous config saved to /var/cache/conftool/dbconfig/20221207-233130-ladsgroup.json
[23:31:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[23:31:34] <stashbot>	 T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618
[23:31:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[23:32:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:32:08] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:36:12] <icinga-wm>	 RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops
[23:38:25] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1012.eqiad.wmnet with OS bullseye
[23:43:56] <wikibugs>	 (03PS2) 10RLazarus: Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707)
[23:43:58] <wikibugs>	 (03PS2) 10RLazarus: Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789
[23:44:00] <wikibugs>	 (03PS2) 10RLazarus: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707)
[23:44:02] <wikibugs>	 (03PS1) 10RLazarus: Typing cleanup, mostly associated with Python version upgrade [software/httpbb] - 10https://gerrit.wikimedia.org/r/865794
[23:45:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[23:45:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Typing cleanup, mostly associated with Python version upgrade [software/httpbb] - 10https://gerrit.wikimedia.org/r/865794 (owner: 10RLazarus)
[23:45:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[23:53:31] <wikibugs>	 (03PS3) 10RLazarus: Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789
[23:53:33] <wikibugs>	 (03PS3) 10RLazarus: Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707)
[23:53:35] <wikibugs>	 (03PS3) 10RLazarus: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707)
[23:54:05] <wikibugs>	 (03Abandoned) 10RLazarus: Typing cleanup, mostly associated with Python version upgrade [software/httpbb] - 10https://gerrit.wikimedia.org/r/865794 (owner: 10RLazarus)
[23:57:27] <wikibugs>	 (03CR) 10jenkins-bot: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[23:59:50] <wikibugs>	 (03PS4) 10RLazarus: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707)