[00:01:58] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8
[00:20:50] <wikibugs>	 (03PS1) 10Tim Starling: Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669
[00:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:29:47] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:31:38] <wikibugs>	 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi looks good to me
[00:34:01] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[00:36:35] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye
[00:36:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with...
[00:43:15] <wikibugs>	 (03PS2) 10Tim Starling: Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669
[00:52:41] <wikibugs>	 (03PS3) 10Tim Starling: [WIP] Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669
[01:00:53] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:05:04] <wikibugs>	 (03PS4) 10Tim Starling: [WIP] Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669
[01:08:02] <wikibugs>	 (03PS5) 10Tim Starling: Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669
[01:09:53] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1004.wikimedia.org.service,rsync-data-backup-gitlab1004.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:10:25] <wikibugs>	 (03PS6) 10Tim Starling: Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669 (https://phabricator.wikimedia.org/T212129)
[01:12:02] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: restart to enable S3 plugin - bking@cumin1001 - T309720
[01:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:12:08] <stashbot>	 T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720
[01:14:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:15:01] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:15:30] <wikibugs>	 (03CR) 10Tim Starling: "I tested it with the puppet compiler, confirming that no change is made to servers that are not in x2. https://puppet-compiler.wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/802669 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling)
[01:20:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[01:20:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[01:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[01:20:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[01:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298560)', diff saved to https://phabricator.wikimedia.org/P29365 and previous config saved to /var/cache/conftool/dbconfig/20220603-012045-ladsgroup.json
[01:20:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:49] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[01:26:56] <wikibugs>	 (03CR) 10MacFan4000: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802612 (owner: 10MacFan4000)
[01:49:02] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling)
[01:54:21] <TimStarling>	 !log on db1151 (x2), created mainstash database and applied suitable grants
[01:54:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:31:46] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10tstarling) I created the database and table, applied the grants, and tested it from eval.php, testing the wikiadm...
[02:32:43] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:42:23] <wikibugs>	 (03PS1) 10BryanDavis: proxy: horrible hack for T309821 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821)
[02:45:12] <wikibugs>	 (03PS2) 10BryanDavis: proxy: horrible hack for T309821 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821)
[02:47:37] <wikibugs>	 (03CR) 10BryanDavis: "This is disgusting mostly because I have no idea what changed that is making the response be empty now." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[02:53:44] <wikibugs>	 (03PS1) 10BryanDavis: d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821)
[02:54:09] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] proxy: horrible hack for T309821 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[02:55:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[02:55:10] <wikibugs>	 (03Merged) 10jenkins-bot: proxy: horrible hack for T309821 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[03:00:00] <wikibugs>	 (03PS2) 10BryanDavis: d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821)
[03:01:38] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[03:02:33] <wikibugs>	 (03PS1) 10Samwilson: Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802685 (https://phabricator.wikimedia.org/T307725)
[03:03:19] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:03:22] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[03:06:55] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:09:03] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:33:51] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:57:43] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:24:13] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:33:39] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gitlab1004), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:56:32] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) I'm fully alone next week as the rest of the team is gone, can we enable this the following week inst...
[05:19:01] <marostegui>	 !log Stop mysql on db1128 for on-site maintenance T309291
[05:19:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:19:09] <stashbot>	 T309291: db1128 faulty memory - https://phabricator.wikimedia.org/T309291
[05:20:38] <wikibugs>	 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Marostegui) @Cmjohnson db1128 is now off and ready for you to change its DIMM anytime. Once done please bring it back and I will start mysql etc.  Thanks!
[05:21:57] <wikibugs>	 (03PS1) 10Marostegui: db1128: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/802689 (https://phabricator.wikimedia.org/T309303)
[05:23:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Parsoid: Retire the old Parsoid deb repository? - https://phabricator.wikimedia.org/T309765 (10MoritzMuehlenhoff) Ack, sounds good. I'll wait for a week and if there are no objections, the repo will be removed from the release* hosts.
[05:23:48] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10MoritzMuehlenhoff) Thanks, we'll proceed as soon as the NDA is wrapped up.
[05:25:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1128: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/802689 (https://phabricator.wikimedia.org/T309303) (owner: 10Marostegui)
[05:28:25] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:34:55] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:53:31] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:24] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10tstarling) OK, how about June 14, 05:00 UTC?
[06:42:45] <wikibugs>	 (03CR) 10Joal: [C: 04-1] "Missing tables in the first bit of the split - Otherwise good! thanks Dan" [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric)
[06:47:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802567 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:47:59] <wikibugs>	 (03PS2) 10Muehlenhoff: puppet_stastd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802567 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:53:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] netops: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802572 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:53:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802572 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:53:52] <wikibugs>	 (03PS2) 10Muehlenhoff: netops: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802572 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[06:57:29] <wikibugs>	 (03PS2) 10Muehlenhoff: query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220603T0700)
[07:00:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, adding Search people for a second pass." [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:03:13] <wikibugs>	 (03PS2) 10Muehlenhoff: poolcounter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:07:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:08:27] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:11:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802571 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:11:49] <wikibugs>	 (03PS2) 10Muehlenhoff: nftables: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802571 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:16:46] <jayme>	 !log imported scap 4.8.2 to stretch-/buster-/bullseye-wikimedia - T309116
[07:16:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:51] <stashbot>	 T309116: Deploy Scap version 4.8.2 - https://phabricator.wikimedia.org/T309116
[07:20:57] <logmsgbot>	 !log jayme@deploy1002 Started deploy [restbase/deploy@6e39559] (dev-cluster): (no justification provided)
[07:20:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802568 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:05] <wikibugs>	 (03PS2) 10Muehlenhoff: presto: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802568 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:28:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802570 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:28:54] <wikibugs>	 (03PS2) 10Muehlenhoff: pontoon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802570 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:33:01] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.2 - https://phabricator.wikimedia.org/T309116 (10JMeybohm) Deployed to canaries, scap pull looks fine
[07:33:35] <logmsgbot>	 !log jayme@deploy1002 Finished deploy [restbase/deploy@6e39559] (dev-cluster): (no justification provided) (duration: 12m 38s)
[07:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802565 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:37:32] <wikibugs>	 (03PS2) 10Muehlenhoff: rabbitmq: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802565 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:39:07] <wikibugs>	 (03PS2) 10Slyngshede: kubeadm: drop support for 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah)
[07:39:09] <wikibugs>	 (03PS2) 10Slyngshede: aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah)
[07:39:38] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35714/console" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:42:13] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:42:25] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:44:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.296 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:44:33] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48248 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:52:34] <wikibugs>	 (03CR) 10Muehlenhoff: raid: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:53:24] <wikibugs>	 (03CR) 10Muehlenhoff: raid: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:55:35] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:03:45] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:09:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Puppet references to idp-test1001/idp-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/802729 (https://phabricator.wikimedia.org/T308214)
[08:09:35] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:10:06] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[08:15:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references to idp-test1001/idp-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/802729 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff)
[08:17:56] <wikibugs>	 (03PS3) 10Muehlenhoff: query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[08:20:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[08:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:37:22] <wikibugs>	 (03PS3) 10Slyngshede: kubeadm: drop support for 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah)
[08:37:24] <wikibugs>	 (03PS3) 10Slyngshede: aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah)
[08:37:38] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2] aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah)
[08:45:39] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802531 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro)
[08:49:35] <wikibugs>	 10SRE, 10Gerrit: Icinga Check SSL might have a time based race condition - https://phabricator.wikimedia.org/T308908 (10hashar) 05Open→03Resolved a:03dancy @dancy solved it by restarting Apache2. The root cause is somewhere in Apache 2 and is tracked by T293826
[08:52:50] <wikibugs>	 10SRE, 10Traffic, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10hashar)
[08:56:56] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[08:56:58] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[08:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:13] <logmsgbot>	 !log jnuche@deploy1002 install-world aborted:  (duration: 00m 03s)
[08:58:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:47] <wikibugs>	 (03PS8) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[09:00:02] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:00:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:00] <wikibugs>	 (03PS9) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[09:11:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts idp-test2001.wikimedia.org
[09:11:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:12] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): httpbb: Add basic tests for query_service (WDQS, WCQS) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802079 (owner: 10Lucas Werkmeister (WMDE))
[09:14:22] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): httpbb: Add basic tests for query_service (WDQS) [puppet] - 10https://gerrit.wikimedia.org/r/802079
[09:14:24] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): query_service: don’t cache index files [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243)
[09:15:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:15:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:20:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test2001.wikimedia.org
[09:20:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `idp-test2001.wikimedia.org` - idp-test2001.wikimedia.org (**PASS**)   - Downtimed hos...
[09:21:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts idp-test1001.wikimedia.org
[09:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:25] <wikibugs>	 (03PS16) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067
[09:24:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:48] <wikibugs>	 (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802732 (https://phabricator.wikimedia.org/T307348) (owner: 10WMDE-Fisch)
[09:28:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:28:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test1001.wikimedia.org
[09:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `idp-test1001.wikimedia.org` - idp-test1001.wikimedia.org (**PASS**)   - Downtimed hos...
[09:32:26] <wikibugs>	 (03CR) 10Ayounsi: Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[09:39:21] <wikibugs>	 (03PS17) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067
[09:41:00] <wikibugs>	 (03CR) 10Vgutierrez: [WIP] esitest service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[09:42:59] <wikibugs>	 (03PS11) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040
[09:43:01] <wikibugs>	 (03CR) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro)
[09:43:06] <wikibugs>	 (03CR) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro)
[09:43:20] <wikibugs>	 (03CR) 10Ayounsi: "I haven't look at the code but I think you tested it with the ulsfo cluster on netbox-next." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans)
[09:45:24] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] ceph: filter out also dbgsym packages [puppet] - 10https://gerrit.wikimedia.org/r/802531 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro)
[09:53:42] <wikibugs>	 (03PS1) 10David Caro: ceph: fix regex to match dbg/dbgsym [puppet] - 10https://gerrit.wikimedia.org/r/802736 (https://phabricator.wikimedia.org/T309786)
[09:54:36] <wikibugs>	 (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[09:55:19] <wikibugs>	 (03PS18) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067
[09:56:42] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[10:00:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35715/console" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[10:01:45] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) >>! In T212129#7978237, @tstarling wrote: > OK, how about June 14, 05:00 UTC?  That would work. I wil...
[10:03:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Lgaulia - https://phabricator.wikimedia.org/T309844 (10larissagaulia)
[10:05:22] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:05:56] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Refresh English Wikipedia logo file (enwiki.png) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544)
[10:06:37] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Scheduled for deployment on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE))
[10:08:40] <wikibugs>	 (03PS1) 10Jbond: P:sretest: DO NOT MERGE - test prometheus::blackbox::check::http define [puppet] - 10https://gerrit.wikimedia.org/r/802737
[10:08:54] <wikibugs>	 (03PS10) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[10:09:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35716/console" [puppet] - 10https://gerrit.wikimedia.org/r/802737 (owner: 10Jbond)
[10:13:09] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Refresh English Wikipedia logo file (enwiki.png) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE))
[10:17:40] <wikibugs>	 (03CR) 10Ayounsi: "I think it would be better for consistency to use the 10/8 prod realm loopback. We could potentially add separate entries for the VRF loop" [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[10:21:54] <wikibugs>	 (03PS11) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack)
[10:38:41] <wikibugs>	 (03PS1) 10Muehlenhoff: remove stray webperf entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/802746
[10:41:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] remove stray webperf entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/802746 (owner: 10Muehlenhoff)
[10:53:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable webperf1004/2004 as new Arclamp hosts [puppet] - 10https://gerrit.wikimedia.org/r/802749
[10:53:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Point active arclamp host to webperf1004 and update dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/802750 (https://phabricator.wikimedia.org/T305460)
[10:55:16] <icinga-wm>	 PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ulogd2.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:57:05] <moritzm>	 ^ that's me fixing up sretest1001
[10:58:18] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:59:54] <icinga-wm>	 RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:52] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:10:52] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:12:24] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:13:02] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48248 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:13:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.322 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:13:02] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:14:34] <wikibugs>	 (03PS1) 10Muehlenhoff: arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460)
[11:15:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[11:18:25] <wikibugs>	 (03PS2) 10Muehlenhoff: arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460)
[11:19:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[11:22:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298560)', diff saved to https://phabricator.wikimedia.org/P29366 and previous config saved to /var/cache/conftool/dbconfig/20220603-112234-ladsgroup.json
[11:22:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:40] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[11:27:48] <wikibugs>	 (03PS3) 10Muehlenhoff: arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460)
[11:36:21] <wikibugs>	 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff)
[11:37:34] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29367 and previous config saved to /var/cache/conftool/dbconfig/20220603-113739-ladsgroup.json
[11:37:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki 0.2.1: Add a helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118 (owner: 10Ahmon Dancy)
[11:49:10] <wikibugs>	 (03PS1) 10Muehlenhoff: ipmi: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802757 (https://phabricator.wikimedia.org/T308013)
[11:49:12] <wikibugs>	 (03PS1) 10Muehlenhoff: webperf: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802758 (https://phabricator.wikimedia.org/T308013)
[11:51:24] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki 0.2.1: Add a helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118 (owner: 10Ahmon Dancy)
[11:52:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29368 and previous config saved to /var/cache/conftool/dbconfig/20220603-115244-ladsgroup.json
[11:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:15] <wikibugs>	 (03PS5) 10KartikMistry: Update cxserver to  2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963)
[11:55:37] <wikibugs>	 (03CR) 10KartikMistry: Update cxserver to  2022-05-31-123738-production (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry)
[11:57:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update cxserver to  2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry)
[12:01:55] <wikibugs>	 (03PS6) 10KartikMistry: Update cxserver to  2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963)
[12:05:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Add lgaulia to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/802761 (https://phabricator.wikimedia.org/T309844)
[12:06:38] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:58] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:07:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298560)', diff saved to https://phabricator.wikimedia.org/P29369 and previous config saved to /var/cache/conftool/dbconfig/20220603-120750-ladsgroup.json
[12:07:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[12:07:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[12:07:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:55] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[12:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298560)', diff saved to https://phabricator.wikimedia.org/P29370 and previous config saved to /var/cache/conftool/dbconfig/20220603-120758-ladsgroup.json
[12:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add lgaulia to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/802761 (https://phabricator.wikimedia.org/T309844) (owner: 10Muehlenhoff)
[12:13:13] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Lgaulia - https://phabricator.wikimedia.org/T309844 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03Medium a:03MoritzMuehlenhoff @larissagaulia Your LDAP access to the wmf group has been enabled. Please reopen the task...
[12:25:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] mwdebug service: Add traindev environment support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (owner: 10Ahmon Dancy)
[12:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:26:47] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks for the updates, I'll revise and update." [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[12:32:12] <wikibugs>	 (03PS2) 10Cathal Mooney: Add cloudsw1-e4 and cloudsw1-f4 to mgmt and adjust existing cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989)
[12:55:17] <wikibugs>	 (03PS1) 10BryanDavis: Revert "proxy: horrible hack for T309821" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802620 (https://phabricator.wikimedia.org/T309821)
[12:56:06] <wikibugs>	 (03PS1) 10Itamar Givon: Turn Wikbase termbox SSR off for beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328)
[12:56:56] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Revert "proxy: horrible hack for T309821" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802620 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[12:57:32] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] Revert "proxy: horrible hack for T309821" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802620 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[12:57:59] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to  2022-05-31-123738-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry)
[12:58:36] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "proxy: horrible hack for T309821" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802620 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[13:03:00] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:55] <wikibugs>	 (03PS1) 10Jforrester: extdist: Drop 1.36, now EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802772 (https://phabricator.wikimedia.org/T309864)
[13:09:01] <wikibugs>	 (03PS1) 10BryanDavis: d/changelog: Prepare for 0.86 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802773 (https://phabricator.wikimedia.org/T309821)
[13:09:41] <wikibugs>	 (03PS1) 10Jaime Nuche: scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713)
[13:10:21] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] "Batch operation has not been merged yet" [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche)
[13:10:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche)
[13:11:41] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] d/changelog: Prepare for 0.86 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802773 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[13:12:52] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.86 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802773 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis)
[13:17:38] <wikibugs>	 (03PS2) 10Jaime Nuche: scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713)
[13:18:11] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] "Batch operation has not been merged yet" [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche)
[13:22:19] <wikibugs>	 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Cmjohnson) 05Open→03Resolved replaced the DIMM and updated BIOS
[13:24:47] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Turn Wikbase termbox SSR off for beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon)
[13:25:35] <wikibugs>	 (03CR) 10Jakob: [C: 04-1] "Ugh, I think there is some confusing naming going on here and this doesn't quite do what we want. https://gerrit.wikimedia.org/g/operation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon)
[13:28:57] <wikibugs>	 (03PS7) 10KartikMistry: Update cxserver to  2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963)
[13:29:19] <wikibugs>	 (03CR) 10Itamar Givon: Turn Wikbase termbox SSR off for beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon)
[13:30:03] <wikibugs>	 (03CR) 10KartikMistry: Update cxserver to  2022-05-31-123738-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry)
[13:40:39] <wikibugs>	 (03PS3) 10Jaime Nuche: scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713)
[13:42:44] <wikibugs>	 (03PS1) 10Ladsgroup: switchover-tmpl: Add commands for the heartbeat and zarcillo [software] - 10https://gerrit.wikimedia.org/r/802778
[13:43:38] <wikibugs>	 (03CR) 10Jakob: [C: 04-1] Turn Wikbase termbox SSR off for beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon)
[13:44:22] <Zppix>	 jouncebot: nowandnext
[13:44:23] <jouncebot>	 For the next 17 hour(s) and 15 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220603T0700)
[13:44:23] <jouncebot>	 In 17 hour(s) and 15 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220604T0700)
[13:45:14] <wikibugs>	 (03CR) 10Ladsgroup: switchover-tmpl: Add commands for the heartbeat and zarcillo (031 comment) [software] - 10https://gerrit.wikimedia.org/r/802778 (owner: 10Ladsgroup)
[13:45:52] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[13:51:28] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche)
[13:54:44] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[13:55:36] <wikibugs>	 (03PS2) 10Eevans: WIP: Configure AQS Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801)
[13:56:14] <wikibugs>	 (03PS3) 10Eevans: WIP: Configure AQS Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801)
[13:57:51] <wikibugs>	 (03CR) 10Jakob: [C: 04-1] Turn Wikbase termbox SSR off for beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon)
[14:03:30] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[14:08:06] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 2 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[14:10:30] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:14:42] <inflatador>	 !log patching and restarting a few eqiad elastic hosts T309868
[14:14:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:47] <stashbot>	 T309868: Test openjdk 8 package upgrades on eqiad stretch hosts - https://phabricator.wikimedia.org/T309868
[14:25:37] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons.
[14:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:26] <wikibugs>	 (03CR) 10Mabualruz: Remove 6 deprecated ResourceLoader skin modules in core (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802578 (https://phabricator.wikimedia.org/T304322) (owner: 10Mabualruz)
[14:48:50] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:55:30] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Sorry, I didn’t check what this variable actually controls earlier." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon)
[14:56:36] <wikibugs>	 (03PS1) 10Jcrespo: MySQLMedia: Add unit testing and small refactorings [software/mediabackups] - 10https://gerrit.wikimedia.org/r/802787
[14:58:20] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: BDAT
[14:58:22] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: BDAT
[14:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:06] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:10:56] <icinga-wm>	 PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:15:27] <wikibugs>	 (03PS1) 10Jelto: Revert "gitlab: reduce backup_keep_time to 2d" [puppet] - 10https://gerrit.wikimedia.org/r/802622
[15:17:56] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:20:09] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35719/console" [puppet] - 10https://gerrit.wikimedia.org/r/802622 (owner: 10Jelto)
[15:20:12] <wikibugs>	 (03PS2) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170
[15:21:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[15:22:04] <wikibugs>	 (03CR) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[15:23:58] <wikibugs>	 (03PS3) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170
[15:24:41] <wikibugs>	 (03CR) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[15:26:57] <wikibugs>	 (03CR) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[15:30:06] <wikibugs>	 (03CR) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[15:31:40] <wikibugs>	 (03PS4) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883
[15:32:36] <wikibugs>	 (03PS4) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170
[15:34:57] <wikibugs>	 (03PS5) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (https://phabricator.wikimedia.org/T299648)
[15:35:27] <wikibugs>	 (03CR) 10Ahmon Dancy: mwdebug service: Add traindev environment support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy)
[15:40:48] <dancy>	 jounebot nowandnext
[15:40:51] <dancy>	 jouncebot nowandnext
[15:40:51] <jouncebot>	 For the next 15 hour(s) and 19 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220603T0700)
[15:40:51] <jouncebot>	 In 15 hour(s) and 19 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220604T0700)
[15:43:39] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] docker_registry_ha: Authorize GitLab trusted runners using JWT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[15:43:57] <wikibugs>	 (03PS9) 10Jelto: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes)
[15:49:13] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35720/console" [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes)
[15:50:06] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:50:32] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall)
[15:56:58] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "Technically that looks fine for me now and the new host have a dedicated volume for the registry data." [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes)
[16:03:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MaxSem)
[16:06:30] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons.
[16:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:36] <wikibugs>	 (03PS5) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[16:11:08] <wikibugs>	 (03CR) 10Andrew Bogott: "updated with the version of this patch that seems to actually run" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah)
[16:11:45] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: restart to enable S3 plugin - bking@cumin1001 - T309720
[16:11:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:49] <stashbot>	 T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720
[16:12:06] <icinga-wm>	 RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:12:18] <logmsgbot>	 !log dancy@deploy1002 sync-wikiversions aborted: testing mediawiki container image build and deploy (duration: 00m 11s)
[16:12:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:08] <icinga-wm>	 PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops
[16:13:24] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:13:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:53] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mx1001.wikimedia.org with reason: BDAT
[16:15:54] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mx1001.wikimedia.org with reason: BDAT
[16:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:29] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:14] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 54885 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[16:19:31] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:08] <logmsgbot>	 !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:20:09] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:25] <logmsgbot>	 !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:20:26] <logmsgbot>	 !log dancy@deploy1002 sync-wikiversions aborted: testing mediawiki container image build and deploy (duration: 07m 07s)
[16:20:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:00] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:26:35] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap.cfg.erb: Set release_repo_update_mediawiki_releases_values_cmd [puppet] - 10https://gerrit.wikimedia.org/r/802795 (https://phabricator.wikimedia.org/T299648)
[16:46:02] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] "Thinking about this a bit more, I'm really not seeing this being more than a low risk, in what the proposed CSP policy opens up.  Again, t" [puppet] - 10https://gerrit.wikimedia.org/r/801776 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope)
[16:52:34] <wikibugs>	 (03PS1) 10Ahmon Dancy: mediawiki 0.2.2: Run test job as uid 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/802799
[16:55:34] <icinga-wm>	 RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops
[16:56:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Nemo_bis)
[17:09:53] <wikibugs>	 (03PS2) 10Zabe: raid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013)
[17:11:21] <wikibugs>	 (03CR) 10Zabe: raid: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[17:12:16] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] [cirrus] Fix typo in config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse)
[17:14:10] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:14:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:14:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:14:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:18:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:50] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:33:00] <wikibugs>	 (03PS14) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661)
[17:33:02] <wikibugs>	 (03PS1) 10JMeybohm: black format cookbooks/sre/__init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/802810
[17:33:04] <wikibugs>	 (03PS1) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[17:36:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[17:52:52] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:53:34] <wikibugs>	 (03PS2) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[17:54:16] <wikibugs>	 (03CR) 10Herron: "Thanks for this!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus)
[17:55:38] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[17:56:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[17:58:54] <wikibugs>	 (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[18:00:12] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[18:01:34] <wikibugs>	 (03PS3) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[18:01:49] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2019.codfw.wmnet
[18:01:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:28] <rzl>	 herron: ahahaha that sure didn't work as intended, did it, thanks for the catch
[18:05:35] <rzl>	 (as much as the 1000.00% availability for etcd looks great IMO)
[18:07:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: adjust tests files to renamed role class [puppet] - 10https://gerrit.wikimedia.org/r/802580 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[18:07:59] <wikibugs>	 (03PS2) 10Dzahn: vrts: adjust tests files to renamed role class [puppet] - 10https://gerrit.wikimedia.org/r/802580 (https://phabricator.wikimedia.org/T293942)
[18:08:50] <herron>	 rzl ha! tbh it’s great to review these though, I think theres a lot we could standardize/simplify too
[18:09:23] <rzl>	 yeah totally
[18:09:34] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2019.codfw.wmnet
[18:09:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:12] <wikibugs>	 (03PS2) 10Dzahn: Revert "gitlab: reduce backup_keep_time to 2d" [puppet] - 10https://gerrit.wikimedia.org/r/802622 (owner: 10Jelto)
[18:11:56] <rzl>	 ahhh I see what happened, all that time staring at those parentheses and I still ended up with them in the wrong place -- I'll turn that around later
[18:12:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "gitlab: reduce backup_keep_time to 2d" [puppet] - 10https://gerrit.wikimedia.org/r/802622 (owner: 10Jelto)
[18:18:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "fwiw, this triggered a Exec[Reconfigure GitLab] and that takes gitlab down for a moment.. it did come back though right when I started to " [puppet] - 10https://gerrit.wikimedia.org/r/802622 (owner: 10Jelto)
[18:19:49] <wikibugs>	 (03PS4) 10Eevans: WIP: Configure AQS Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801)
[18:21:54] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:29:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "scap.cfg: Enable rsync_cdbs in beta" [puppet] - 10https://gerrit.wikimedia.org/r/801746 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy)
[18:29:47] <wikibugs>	 (03PS3) 10Dzahn: Revert "scap.cfg: Enable rsync_cdbs in beta" [puppet] - 10https://gerrit.wikimedia.org/r/801746 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy)
[18:34:32] <mutante>	 !log deleting expired digicert TLS certs https://gerrit.wikimedia.org/r/c/operations/puppet/+/791678
[18:34:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:50] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder)
[18:37:51] <wikibugs>	 (03PS5) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245
[18:40:59] <wikibugs>	 (03CR) 10RLazarus: httpbb: Add basic tests for query_service (WDQS) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802079 (owner: 10Lucas Werkmeister (WMDE))
[18:42:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35721/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn)
[18:45:50] <mutante>	 !log testreduce1001 - systemctl reset-failed after gerrit:800245 removed failed auto_restart services for non-existing apache and php services
[18:45:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:01] <mutante>	 !log testreduce - re-enabling Icinga notifications that were disabled for unknown reasons
[18:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:40] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on mx1001.wikimedia.org with reason: BDAT
[18:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:42] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on mx1001.wikimedia.org with reason: BDAT
[18:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:37] <wikibugs>	 10SRE: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10Dzahn)
[18:54:02] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:54:59] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal, 10Patch-For-Review: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10Krinkle) >>! In T299764#7813469, @jcrespo wrote: > The main issue I ran into is that it was said it was guaran...
[18:55:58] <wikibugs>	 10SRE: an-tool1005 - memcached Connection refused - https://phabricator.wikimedia.org/T309886 (10Dzahn)
[18:58:17] <wikibugs>	 (03CR) 10Dzahn: "are these files still relevant?" [puppet] - 10https://gerrit.wikimedia.org/r/802579 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[19:04:52] <wikibugs>	 (03CR) 10Herron: Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:07:51] <wikibugs>	 (03PS1) 10Dzahn: Bug: T307142 Change-Id: I13b8b25f66cf7c384ce464bbbeb9b7a3a7dc3861 [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142)
[19:08:53] <wikibugs>	 (03CR) 10Dzahn: "check the mysql privileges with DBA. would be good to know whether the new instance can or can not write to the same DB" [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:11:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bug: T307142 Change-Id: I13b8b25f66cf7c384ce464bbbeb9b7a3a7dc3861 [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[19:11:15] <wikibugs>	 (03PS2) 10Dzahn: logtash: replace gitlab1001 with gitlab1004 in tests [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142)
[19:14:25] <wikibugs>	 (03PS1) 10Dzahn: gitlab/acme_chief: remove gitlab1001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142)
[19:16:03] <wikibugs>	 (03PS1) 10Dzahn: gitlab::dump: delete class [puppet] - 10https://gerrit.wikimedia.org/r/802823 (https://phabricator.wikimedia.org/T274463)
[19:16:23] <wikibugs>	 (03PS51) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040)
[19:16:44] <wikibugs>	 (03PS2) 10Dzahn: gitlab::dump: delete role and profile classes [puppet] - 10https://gerrit.wikimedia.org/r/802823 (https://phabricator.wikimedia.org/T274463)
[19:16:48] <wikibugs>	 (03CR) 10Herron: [C: 03+1] wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro)
[19:17:42] <wikibugs>	 (03PS1) 10Dzahn: DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142)
[19:19:53] <wikibugs>	 (03PS1) 10Dzahn: site: remove gitlab1001, adjust gitlab machine descriptions [puppet] - 10https://gerrit.wikimedia.org/r/802846 (https://phabricator.wikimedia.org/T307142)
[19:20:27] <wikibugs>	 (03CR) 10Herron: Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse)
[19:20:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T298560)', diff saved to https://phabricator.wikimedia.org/P29379 and previous config saved to /var/cache/conftool/dbconfig/20220603-192042-ladsgroup.json
[19:20:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:47] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[19:22:27] <wikibugs>	 (03PS1) 10Dzahn: site: remove gitlab_dump role from gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/802847 (https://phabricator.wikimedia.org/T274463)
[19:23:03] <wikibugs>	 (03PS2) 10Dzahn: site: remove gitlab_dump role from gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/802847 (https://phabricator.wikimedia.org/T274463)
[19:23:08] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:24:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: remove gitlab_dump role from gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/802847 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[19:26:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Filebucketed /etc/ferm/conf.d/10_bacula-file-daemon-backup1001.eqiad.wmnet to puppet ..etc we had backup::host on this but not a fileset." [puppet] - 10https://gerrit.wikimedia.org/r/802847 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[19:29:37] <mutante>	 !log gitlab2002 - stop rsync service, apt-get remove --purge rsync, delete /etc/rsync.d/ and /etc/rsyncd.conf - after gerrit:802847 T274463
[19:29:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:42] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[19:30:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "not used after https://gerrit.wikimedia.org/r/c/operations/puppet/+/802847" [puppet] - 10https://gerrit.wikimedia.org/r/802823 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn)
[19:34:19] <wikibugs>	 (03PS1) 10Dzahn: vrts: rename daemon resource and template from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802849 (https://phabricator.wikimedia.org/T293942)
[19:35:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29380 and previous config saved to /var/cache/conftool/dbconfig/20220603-193547-ladsgroup.json
[19:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:30] <wikibugs>	 (03PS1) 10Dzahn: vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942)
[19:37:17] <wikibugs>	 (03CR) 10Cwhite: "These files are test fixtures that have no effect on production.  This change by itself is safe." [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[19:39:23] <wikibugs>	 (03PS1) 10Dzahn: vrts: rename exim4 templates from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802851 (https://phabricator.wikimedia.org/T293942)
[19:40:06] <wikibugs>	 (03CR) 10Dzahn: "makes sense. thank you!. I would like to merge it anyways just to clean up occurences of the machine name then." [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[19:44:47] <wikibugs>	 (03PS1) 10Dzahn: vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942)
[19:47:29] <wikibugs>	 (03PS1) 10Dzahn: vrts: rename ferm services from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802853 (https://phabricator.wikimedia.org/T293942)
[19:50:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29381 and previous config saved to /var/cache/conftool/dbconfig/20220603-195052-ladsgroup.json
[19:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:08] <wikibugs>	 (03PS1) 10Dzahn: mx: rename OTRS database related variables [puppet] - 10https://gerrit.wikimedia.org/r/802854 (https://phabricator.wikimedia.org/T293942)
[19:54:16] <wikibugs>	 (03PS2) 10Dzahn: mx: rename OTRS database related variables [puppet] - 10https://gerrit.wikimedia.org/r/802854 (https://phabricator.wikimedia.org/T293942)
[19:54:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] logtash: replace gitlab1001 with gitlab1004 in tests [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn)
[19:55:09] <wikibugs>	 (03PS10) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537)
[20:05:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T298560)', diff saved to https://phabricator.wikimedia.org/P29382 and previous config saved to /var/cache/conftool/dbconfig/20220603-200557-ladsgroup.json
[20:06:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[20:06:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[20:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:03] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[20:06:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298560)', diff saved to https://phabricator.wikimedia.org/P29383 and previous config saved to /var/cache/conftool/dbconfig/20220603-200606-ladsgroup.json
[20:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:00] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] Remove 6 deprecated ResourceLoader skin modules in core (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802578 (https://phabricator.wikimedia.org/T304322) (owner: 10Mabualruz)
[20:20:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Dzahn) > bundle exec rake 'spdx:convert:module[MODULENAME]'  Is there any way to install the ruby gem "puppet" from a Debian package?  `Could not find gem 'puppet (= 5.5.1...
[20:26:16] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:10:16] <wikibugs>	 (03PS1) 10Cwhite: opensearch: add support for managing opensearch 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/802862 (https://phabricator.wikimedia.org/T304440)
[21:10:18] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: change opensearch version to 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/802863 (https://phabricator.wikimedia.org/T304440)
[21:34:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298560)', diff saved to https://phabricator.wikimedia.org/P29384 and previous config saved to /var/cache/conftool/dbconfig/20220603-213423-ladsgroup.json
[21:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:28] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[21:36:31] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: restart to enable S3 plugin - bking@cumin1001 - T309720
[21:36:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:35] <stashbot>	 T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720
[21:41:39] <wikibugs>	 (03PS1) 10Brennen Bearnes: tag-release.sh: add some logging, more rigorous tag push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/802868
[21:42:12] <wikibugs>	 (03PS1) 10Brennen Bearnes: tag-release.sh: add some logging, more rigorous tag push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/802869
[21:43:46] <wikibugs>	 (03Abandoned) 10Brennen Bearnes: tag-release.sh: add some logging, more rigorous tag push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/802868 (owner: 10Brennen Bearnes)
[21:49:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29385 and previous config saved to /var/cache/conftool/dbconfig/20220603-214928-ladsgroup.json
[21:49:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29386 and previous config saved to /var/cache/conftool/dbconfig/20220603-220433-ladsgroup.json
[22:04:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298560)', diff saved to https://phabricator.wikimedia.org/P29387 and previous config saved to /var/cache/conftool/dbconfig/20220603-221938-ladsgroup.json
[22:19:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[22:19:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[22:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:45] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[22:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:48] <wikibugs>	 (03PS4) 10RLazarus: slo: Correct queries for error budget remaining [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842)
[23:20:33] <wikibugs>	 (03CR) 10RLazarus: slo: Correct queries for error budget remaining (034 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus)
[23:24:21] <wikibugs>	 (03PS1) 10Cwhite: add new index pattern format [software/ecs] - 10https://gerrit.wikimedia.org/r/802873 (https://phabricator.wikimedia.org/T305175)
[23:53:50] <wikibugs>	 (03PS1) 10Dzahn: vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942)
[23:54:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[23:55:00] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/35722/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/802849 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[23:56:34] <wikibugs>	 (03PS2) 10Dzahn: vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942)
[23:59:16] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/35723/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)