[00:01:58] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:20:50] (03PS1) 10Tim Starling: Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669 [00:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:29:47] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:31:38] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi looks good to me [00:34:01] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:36:35] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [00:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [00:43:15] (03PS2) 10Tim Starling: Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669 [00:52:41] (03PS3) 10Tim Starling: [WIP] Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669 [01:00:53] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:05:04] (03PS4) 10Tim Starling: [WIP] Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669 [01:08:02] (03PS5) 10Tim Starling: Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669 [01:09:53] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1004.wikimedia.org.service,rsync-data-backup-gitlab1004.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:25] (03PS6) 10Tim Starling: Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669 (https://phabricator.wikimedia.org/T212129) [01:12:02] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: restart to enable S3 plugin - bking@cumin1001 - T309720 [01:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:08] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [01:14:11] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:15:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:30] (03CR) 10Tim Starling: "I tested it with the puppet compiler, confirming that no change is made to servers that are not in x2. https://puppet-compiler.wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/802669 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [01:20:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [01:20:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [01:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298560)', diff saved to https://phabricator.wikimedia.org/P29365 and previous config saved to /var/cache/conftool/dbconfig/20220603-012045-ladsgroup.json [01:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:49] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [01:26:56] (03CR) 10MacFan4000: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802612 (owner: 10MacFan4000) [01:49:02] (03CR) 10Tim Starling: [C: 03+2] Add MariaDB grants for x2 [puppet] - 10https://gerrit.wikimedia.org/r/802669 (https://phabricator.wikimedia.org/T212129) (owner: 10Tim Starling) [01:54:21] !log on db1151 (x2), created mainstash database and applied suitable grants [01:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:46] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10tstarling) I created the database and table, applied the grants, and tested it from eval.php, testing the wikiadm... [02:32:43] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:42:23] (03PS1) 10BryanDavis: proxy: horrible hack for T309821 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821) [02:45:12] (03PS2) 10BryanDavis: proxy: horrible hack for T309821 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821) [02:47:37] (03CR) 10BryanDavis: "This is disgusting mostly because I have no idea what changed that is making the response be empty now." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [02:53:44] (03PS1) 10BryanDavis: d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821) [02:54:09] (03CR) 10BryanDavis: [C: 03+2] proxy: horrible hack for T309821 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [02:55:07] (03CR) 10CI reject: [V: 04-1] d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [02:55:10] (03Merged) 10jenkins-bot: proxy: horrible hack for T309821 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802683 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [03:00:00] (03PS2) 10BryanDavis: d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821) [03:01:38] (03CR) 10BryanDavis: [C: 03+2] d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [03:02:33] (03PS1) 10Samwilson: Update $wgVectorMaxWidthOptions to include action=edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802685 (https://phabricator.wikimedia.org/T307725) [03:03:19] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:03:22] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.85 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802684 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [03:06:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:09:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:33:51] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:57:43] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:24:13] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:33:39] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gitlab1004), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:56:32] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) I'm fully alone next week as the rest of the team is gone, can we enable this the following week inst... [05:19:01] !log Stop mysql on db1128 for on-site maintenance T309291 [05:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:09] T309291: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 [05:20:38] 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Marostegui) @Cmjohnson db1128 is now off and ready for you to change its DIMM anytime. Once done please bring it back and I will start mysql etc. Thanks! [05:21:57] (03PS1) 10Marostegui: db1128: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/802689 (https://phabricator.wikimedia.org/T309303) [05:23:06] 10SRE, 10Infrastructure-Foundations, 10Parsoid: Retire the old Parsoid deb repository? - https://phabricator.wikimedia.org/T309765 (10MoritzMuehlenhoff) Ack, sounds good. I'll wait for a week and if there are no objections, the repo will be removed from the release* hosts. [05:23:48] 10SRE, 10LDAP-Access-Requests: Add Evelien WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T309700 (10MoritzMuehlenhoff) Thanks, we'll proceed as soon as the NDA is wrapped up. [05:25:42] (03CR) 10Marostegui: [C: 03+2] db1128: Add it to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/802689 (https://phabricator.wikimedia.org/T309303) (owner: 10Marostegui) [05:28:25] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:34:55] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:53:31] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:24] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10tstarling) OK, how about June 14, 05:00 UTC? [06:42:45] (03CR) 10Joal: [C: 04-1] "Missing tables in the first bit of the split - Otherwise good! thanks Dan" [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric) [06:47:53] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802567 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:47:59] (03PS2) 10Muehlenhoff: puppet_stastd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802567 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:53:37] (03CR) 10Muehlenhoff: [C: 03+2] netops: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802572 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:53:45] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802572 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:53:52] (03PS2) 10Muehlenhoff: netops: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802572 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [06:57:29] (03PS2) 10Muehlenhoff: query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220603T0700) [07:00:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, adding Search people for a second pass." [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:03:13] (03PS2) 10Muehlenhoff: poolcounter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:07:38] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:08:27] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:11:43] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802571 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:11:49] (03PS2) 10Muehlenhoff: nftables: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802571 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:16:46] !log imported scap 4.8.2 to stretch-/buster-/bullseye-wikimedia - T309116 [07:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:51] T309116: Deploy Scap version 4.8.2 - https://phabricator.wikimedia.org/T309116 [07:20:57] !log jayme@deploy1002 Started deploy [restbase/deploy@6e39559] (dev-cluster): (no justification provided) [07:20:57] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802568 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:05] (03PS2) 10Muehlenhoff: presto: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802568 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:28:48] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802570 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:28:54] (03PS2) 10Muehlenhoff: pontoon: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802570 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:33:01] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.2 - https://phabricator.wikimedia.org/T309116 (10JMeybohm) Deployed to canaries, scap pull looks fine [07:33:35] !log jayme@deploy1002 Finished deploy [restbase/deploy@6e39559] (dev-cluster): (no justification provided) (duration: 12m 38s) [07:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:25] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802565 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:37:32] (03PS2) 10Muehlenhoff: rabbitmq: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802565 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:39:07] (03PS2) 10Slyngshede: kubeadm: drop support for 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah) [07:39:09] (03PS2) 10Slyngshede: aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah) [07:39:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35714/console" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:42:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:42:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:44:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.296 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:44:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48248 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:52:34] (03CR) 10Muehlenhoff: raid: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:53:24] (03CR) 10Muehlenhoff: raid: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:55:35] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:03:45] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:09:25] (03PS1) 10Muehlenhoff: Remove Puppet references to idp-test1001/idp-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/802729 (https://phabricator.wikimedia.org/T308214) [08:09:35] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:10:06] (03CR) 10DCausse: [C: 03+1] query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:15:16] (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references to idp-test1001/idp-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/802729 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [08:17:56] (03PS3) 10Muehlenhoff: query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:20:37] (03CR) 10Muehlenhoff: [C: 03+2] query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802566 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [08:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:37:22] (03PS3) 10Slyngshede: kubeadm: drop support for 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/802143 (owner: 10Majavah) [08:37:24] (03PS3) 10Slyngshede: aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah) [08:37:38] (03CR) 10Slyngshede: [V: 03+2] aptrepo: add thirdparty/kubeadm-k8s-1-22 [puppet] - 10https://gerrit.wikimedia.org/r/802146 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah) [08:45:39] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802531 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [08:49:35] 10SRE, 10Gerrit: Icinga Check SSL might have a time based race condition - https://phabricator.wikimedia.org/T308908 (10hashar) 05Open→03Resolved a:03dancy @dancy solved it by restarting Apache2. The root cause is somewhere in Apache 2 and is tracked by T293826 [08:52:50] 10SRE, 10Traffic, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10hashar) [08:56:56] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:56:58] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [08:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:13] !log jnuche@deploy1002 install-world aborted: (duration: 00m 03s) [08:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:47] (03PS8) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [09:00:02] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:00] (03PS9) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [09:11:13] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts idp-test2001.wikimedia.org [09:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:12] (03CR) 10Lucas Werkmeister (WMDE): httpbb: Add basic tests for query_service (WDQS, WCQS) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802079 (owner: 10Lucas Werkmeister (WMDE)) [09:14:22] (03PS3) 10Lucas Werkmeister (WMDE): httpbb: Add basic tests for query_service (WDQS) [puppet] - 10https://gerrit.wikimedia.org/r/802079 [09:14:24] (03PS4) 10Lucas Werkmeister (WMDE): query_service: don’t cache index files [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) [09:15:04] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:20:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test2001.wikimedia.org [09:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:29] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `idp-test2001.wikimedia.org` - idp-test2001.wikimedia.org (**PASS**) - Downtimed hos... [09:21:07] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts idp-test1001.wikimedia.org [09:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:25] (03PS16) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [09:24:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:48] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802732 (https://phabricator.wikimedia.org/T307348) (owner: 10WMDE-Fisch) [09:28:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:28:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test1001.wikimedia.org [09:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:38] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `idp-test1001.wikimedia.org` - idp-test1001.wikimedia.org (**PASS**) - Downtimed hos... [09:32:26] (03CR) 10Ayounsi: Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [09:39:21] (03PS17) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [09:41:00] (03CR) 10Vgutierrez: [WIP] esitest service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [09:42:59] (03PS11) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [09:43:01] (03CR) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [09:43:06] (03CR) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [09:43:20] (03CR) 10Ayounsi: "I haven't look at the code but I think you tested it with the ulsfo cluster on netbox-next." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [09:45:24] (03CR) 10David Caro: [C: 03+2] ceph: filter out also dbgsym packages [puppet] - 10https://gerrit.wikimedia.org/r/802531 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [09:53:42] (03PS1) 10David Caro: ceph: fix regex to match dbg/dbgsym [puppet] - 10https://gerrit.wikimedia.org/r/802736 (https://phabricator.wikimedia.org/T309786) [09:54:36] (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [09:55:19] (03PS18) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [09:56:42] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:00:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35715/console" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [10:01:45] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) >>! In T212129#7978237, @tstarling wrote: > OK, how about June 14, 05:00 UTC? That would work. I wil... [10:03:55] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Lgaulia - https://phabricator.wikimedia.org/T309844 (10larissagaulia) [10:05:22] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:05:56] (03PS3) 10Lucas Werkmeister (WMDE): Refresh English Wikipedia logo file (enwiki.png) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) [10:06:37] (03CR) 10Lucas Werkmeister (WMDE): "Scheduled for deployment on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE)) [10:08:40] (03PS1) 10Jbond: P:sretest: DO NOT MERGE - test prometheus::blackbox::check::http define [puppet] - 10https://gerrit.wikimedia.org/r/802737 [10:08:54] (03PS10) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [10:09:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35716/console" [puppet] - 10https://gerrit.wikimedia.org/r/802737 (owner: 10Jbond) [10:13:09] (03CR) 10Lucas Werkmeister (WMDE): Refresh English Wikipedia logo file (enwiki.png) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801405 (https://phabricator.wikimedia.org/T309544) (owner: 10Lucas Werkmeister (WMDE)) [10:17:40] (03CR) 10Ayounsi: "I think it would be better for consistency to use the 10/8 prod realm loopback. We could potentially add separate entries for the VRF loop" [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [10:21:54] (03PS11) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [10:38:41] (03PS1) 10Muehlenhoff: remove stray webperf entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/802746 [10:41:01] (03CR) 10Muehlenhoff: [C: 03+2] remove stray webperf entry in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/802746 (owner: 10Muehlenhoff) [10:53:50] (03PS1) 10Muehlenhoff: Enable webperf1004/2004 as new Arclamp hosts [puppet] - 10https://gerrit.wikimedia.org/r/802749 [10:53:52] (03PS1) 10Muehlenhoff: Point active arclamp host to webperf1004 and update dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/802750 (https://phabricator.wikimedia.org/T305460) [10:55:16] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_ulogd2.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:05] ^ that's me fixing up sretest1001 [10:58:18] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:59:54] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:10:52] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:12:24] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48248 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:13:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.322 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:13:02] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:14:34] (03PS1) 10Muehlenhoff: arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) [11:15:47] (03CR) 10CI reject: [V: 04-1] arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [11:18:25] (03PS2) 10Muehlenhoff: arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) [11:19:19] (03CR) 10CI reject: [V: 04-1] arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [11:22:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298560)', diff saved to https://phabricator.wikimedia.org/P29366 and previous config saved to /var/cache/conftool/dbconfig/20220603-112234-ladsgroup.json [11:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:40] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [11:27:48] (03PS3) 10Muehlenhoff: arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) [11:36:21] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [11:37:34] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29367 and previous config saved to /var/cache/conftool/dbconfig/20220603-113739-ladsgroup.json [11:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki 0.2.1: Add a helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118 (owner: 10Ahmon Dancy) [11:49:10] (03PS1) 10Muehlenhoff: ipmi: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802757 (https://phabricator.wikimedia.org/T308013) [11:49:12] (03PS1) 10Muehlenhoff: webperf: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802758 (https://phabricator.wikimedia.org/T308013) [11:51:24] (03Merged) 10jenkins-bot: mediawiki 0.2.1: Add a helm test [deployment-charts] - 10https://gerrit.wikimedia.org/r/800118 (owner: 10Ahmon Dancy) [11:52:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29368 and previous config saved to /var/cache/conftool/dbconfig/20220603-115244-ladsgroup.json [11:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:15] (03PS5) 10KartikMistry: Update cxserver to 2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) [11:55:37] (03CR) 10KartikMistry: Update cxserver to 2022-05-31-123738-production (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [11:57:10] (03CR) 10CI reject: [V: 04-1] Update cxserver to 2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [12:01:55] (03PS6) 10KartikMistry: Update cxserver to 2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) [12:05:22] (03PS1) 10Muehlenhoff: Add lgaulia to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/802761 (https://phabricator.wikimedia.org/T309844) [12:06:38] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:58] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:07:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298560)', diff saved to https://phabricator.wikimedia.org/P29369 and previous config saved to /var/cache/conftool/dbconfig/20220603-120750-ladsgroup.json [12:07:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [12:07:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [12:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:55] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [12:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298560)', diff saved to https://phabricator.wikimedia.org/P29370 and previous config saved to /var/cache/conftool/dbconfig/20220603-120758-ladsgroup.json [12:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:28] (03CR) 10Muehlenhoff: [C: 03+2] Add lgaulia to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/802761 (https://phabricator.wikimedia.org/T309844) (owner: 10Muehlenhoff) [12:13:13] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Lgaulia - https://phabricator.wikimedia.org/T309844 (10MoritzMuehlenhoff) 05Open→03Resolved p:05Triage→03Medium a:03MoritzMuehlenhoff @larissagaulia Your LDAP access to the wmf group has been enabled. Please reopen the task... [12:25:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] mwdebug service: Add traindev environment support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (owner: 10Ahmon Dancy) [12:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:26:47] (03CR) 10Cathal Mooney: "Thanks for the updates, I'll revise and update." [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [12:32:12] (03PS2) 10Cathal Mooney: Add cloudsw1-e4 and cloudsw1-f4 to mgmt and adjust existing cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) [12:55:17] (03PS1) 10BryanDavis: Revert "proxy: horrible hack for T309821" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802620 (https://phabricator.wikimedia.org/T309821) [12:56:06] (03PS1) 10Itamar Givon: Turn Wikbase termbox SSR off for beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) [12:56:56] (03CR) 10Majavah: [C: 03+1] Revert "proxy: horrible hack for T309821" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802620 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [12:57:32] (03CR) 10BryanDavis: [C: 03+2] Revert "proxy: horrible hack for T309821" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802620 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [12:57:59] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to 2022-05-31-123738-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [12:58:36] (03Merged) 10jenkins-bot: Revert "proxy: horrible hack for T309821" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802620 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [13:03:00] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:55] (03PS1) 10Jforrester: extdist: Drop 1.36, now EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802772 (https://phabricator.wikimedia.org/T309864) [13:09:01] (03PS1) 10BryanDavis: d/changelog: Prepare for 0.86 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802773 (https://phabricator.wikimedia.org/T309821) [13:09:41] (03PS1) 10Jaime Nuche: scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) [13:10:21] (03CR) 10Jaime Nuche: [C: 04-1] "Batch operation has not been merged yet" [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [13:10:43] (03CR) 10CI reject: [V: 04-1] scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [13:11:41] (03CR) 10BryanDavis: [C: 03+2] d/changelog: Prepare for 0.86 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802773 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [13:12:52] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.86 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/802773 (https://phabricator.wikimedia.org/T309821) (owner: 10BryanDavis) [13:17:38] (03PS2) 10Jaime Nuche: scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) [13:18:11] (03CR) 10Jaime Nuche: [C: 04-1] "Batch operation has not been merged yet" [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [13:22:19] 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Cmjohnson) 05Open→03Resolved replaced the DIMM and updated BIOS [13:24:47] (03CR) 10Lucas Werkmeister (WMDE): Turn Wikbase termbox SSR off for beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon) [13:25:35] (03CR) 10Jakob: [C: 04-1] "Ugh, I think there is some confusing naming going on here and this doesn't quite do what we want. https://gerrit.wikimedia.org/g/operation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon) [13:28:57] (03PS7) 10KartikMistry: Update cxserver to 2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) [13:29:19] (03CR) 10Itamar Givon: Turn Wikbase termbox SSR off for beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon) [13:30:03] (03CR) 10KartikMistry: Update cxserver to 2022-05-31-123738-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [13:40:39] (03PS3) 10Jaime Nuche: scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) [13:42:44] (03PS1) 10Ladsgroup: switchover-tmpl: Add commands for the heartbeat and zarcillo [software] - 10https://gerrit.wikimedia.org/r/802778 [13:43:38] (03CR) 10Jakob: [C: 04-1] Turn Wikbase termbox SSR off for beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon) [13:44:22] jouncebot: nowandnext [13:44:23] For the next 17 hour(s) and 15 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220603T0700) [13:44:23] In 17 hour(s) and 15 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220604T0700) [13:45:14] (03CR) 10Ladsgroup: switchover-tmpl: Add commands for the heartbeat and zarcillo (031 comment) [software] - 10https://gerrit.wikimedia.org/r/802778 (owner: 10Ladsgroup) [13:45:52] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:51:28] (03CR) 10Jaime Nuche: [C: 04-1] scap: boostrap freshly provisioned scap targets [puppet] - 10https://gerrit.wikimedia.org/r/802775 (https://phabricator.wikimedia.org/T309713) (owner: 10Jaime Nuche) [13:54:44] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:55:36] (03PS2) 10Eevans: WIP: Configure AQS Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) [13:56:14] (03PS3) 10Eevans: WIP: Configure AQS Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) [13:57:51] (03CR) 10Jakob: [C: 04-1] Turn Wikbase termbox SSR off for beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon) [14:03:30] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:08:06] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 2 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:10:30] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:14:42] !log patching and restarting a few eqiad elastic hosts T309868 [14:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:47] T309868: Test openjdk 8 package upgrades on eqiad stretch hosts - https://phabricator.wikimedia.org/T309868 [14:25:37] !log herron@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [14:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:26] (03CR) 10Mabualruz: Remove 6 deprecated ResourceLoader skin modules in core (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802578 (https://phabricator.wikimedia.org/T304322) (owner: 10Mabualruz) [14:48:50] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:55:30] (03CR) 10Lucas Werkmeister (WMDE): "Sorry, I didn’t check what this variable actually controls earlier." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon) [14:56:36] (03PS1) 10Jcrespo: MySQLMedia: Add unit testing and small refactorings [software/mediabackups] - 10https://gerrit.wikimedia.org/r/802787 [14:58:20] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: BDAT [14:58:22] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: BDAT [14:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:06] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:56] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:15:27] (03PS1) 10Jelto: Revert "gitlab: reduce backup_keep_time to 2d" [puppet] - 10https://gerrit.wikimedia.org/r/802622 [15:17:56] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:20:09] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35719/console" [puppet] - 10https://gerrit.wikimedia.org/r/802622 (owner: 10Jelto) [15:20:12] (03PS2) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 [15:21:17] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [15:22:04] (03CR) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [15:23:58] (03PS3) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 [15:24:41] (03CR) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [15:26:57] (03CR) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [15:30:06] (03CR) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [15:31:40] (03PS4) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 [15:32:36] (03PS4) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 [15:34:57] (03PS5) 10Ahmon Dancy: mwdebug service: Add traindev environment support [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (https://phabricator.wikimedia.org/T299648) [15:35:27] (03CR) 10Ahmon Dancy: mwdebug service: Add traindev environment support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/798883 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [15:40:48] jounebot nowandnext [15:40:51] jouncebot nowandnext [15:40:51] For the next 15 hour(s) and 19 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220603T0700) [15:40:51] In 15 hour(s) and 19 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220604T0700) [15:43:39] (03CR) 10Ahmon Dancy: [C: 03+1] docker_registry_ha: Authorize GitLab trusted runners using JWT (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793875 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [15:43:57] (03PS9) 10Jelto: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [15:49:13] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35720/console" [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [15:50:06] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:50:32] (03CR) 10Ahmon Dancy: [C: 03+1] Provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [15:56:58] (03CR) 10Jelto: [V: 03+1 C: 03+1] "Technically that looks fine for me now and the new host have a dedicated volume for the registry data." [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [16:03:33] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MaxSem) [16:06:30] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [16:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:36] (03PS5) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [16:11:08] (03CR) 10Andrew Bogott: "updated with the version of this patch that seems to actually run" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [16:11:45] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: restart to enable S3 plugin - bking@cumin1001 - T309720 [16:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:49] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [16:12:06] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:18] !log dancy@deploy1002 sync-wikiversions aborted: testing mediawiki container image build and deploy (duration: 00m 11s) [16:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:08] PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [16:13:24] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:53] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mx1001.wikimedia.org with reason: BDAT [16:15:54] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mx1001.wikimedia.org with reason: BDAT [16:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:29] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:14] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 54885 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [16:19:31] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:08] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:20:09] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:25] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:20:26] !log dancy@deploy1002 sync-wikiversions aborted: testing mediawiki container image build and deploy (duration: 07m 07s) [16:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:00] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:26:35] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Set release_repo_update_mediawiki_releases_values_cmd [puppet] - 10https://gerrit.wikimedia.org/r/802795 (https://phabricator.wikimedia.org/T299648) [16:46:02] (03CR) 10SBassett: [C: 03+1] "Thinking about this a bit more, I'm really not seeing this being more than a low risk, in what the proposed CSP policy opens up. Again, t" [puppet] - 10https://gerrit.wikimedia.org/r/801776 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [16:52:34] (03PS1) 10Ahmon Dancy: mediawiki 0.2.2: Run test job as uid 1000 [deployment-charts] - 10https://gerrit.wikimedia.org/r/802799 [16:55:34] RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [16:56:55] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Nemo_bis) [17:09:53] (03PS2) 10Zabe: raid: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) [17:11:21] (03CR) 10Zabe: raid: Add SPDX headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:12:16] (03CR) 10Ebernhardson: [C: 03+1] [cirrus] Fix typo in config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse) [17:14:10] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:14:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:14:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:50] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:00] (03PS14) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [17:33:02] (03PS1) 10JMeybohm: black format cookbooks/sre/__init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/802810 [17:33:04] (03PS1) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [17:36:19] (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [17:52:52] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:53:34] (03PS2) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [17:54:16] (03CR) 10Herron: "Thanks for this!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [17:55:38] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [17:56:28] (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [17:58:54] (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [18:00:12] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [18:01:34] (03PS3) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [18:01:49] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2019.codfw.wmnet [18:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:28] herron: ahahaha that sure didn't work as intended, did it, thanks for the catch [18:05:35] (as much as the 1000.00% availability for etcd looks great IMO) [18:07:53] (03CR) 10Dzahn: [C: 03+2] vrts: adjust tests files to renamed role class [puppet] - 10https://gerrit.wikimedia.org/r/802580 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [18:07:59] (03PS2) 10Dzahn: vrts: adjust tests files to renamed role class [puppet] - 10https://gerrit.wikimedia.org/r/802580 (https://phabricator.wikimedia.org/T293942) [18:08:50] rzl ha! tbh it’s great to review these though, I think theres a lot we could standardize/simplify too [18:09:23] yeah totally [18:09:34] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2019.codfw.wmnet [18:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:12] (03PS2) 10Dzahn: Revert "gitlab: reduce backup_keep_time to 2d" [puppet] - 10https://gerrit.wikimedia.org/r/802622 (owner: 10Jelto) [18:11:56] ahhh I see what happened, all that time staring at those parentheses and I still ended up with them in the wrong place -- I'll turn that around later [18:12:29] (03CR) 10Dzahn: [C: 03+2] Revert "gitlab: reduce backup_keep_time to 2d" [puppet] - 10https://gerrit.wikimedia.org/r/802622 (owner: 10Jelto) [18:18:32] (03CR) 10Dzahn: [C: 03+2] "fwiw, this triggered a Exec[Reconfigure GitLab] and that takes gitlab down for a moment.. it did come back though right when I started to " [puppet] - 10https://gerrit.wikimedia.org/r/802622 (owner: 10Jelto) [18:19:49] (03PS4) 10Eevans: WIP: Configure AQS Cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) [18:21:54] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:29:42] (03CR) 10Dzahn: [C: 03+2] Revert "scap.cfg: Enable rsync_cdbs in beta" [puppet] - 10https://gerrit.wikimedia.org/r/801746 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [18:29:47] (03PS3) 10Dzahn: Revert "scap.cfg: Enable rsync_cdbs in beta" [puppet] - 10https://gerrit.wikimedia.org/r/801746 (https://phabricator.wikimedia.org/T297326) (owner: 10Ahmon Dancy) [18:34:32] !log deleting expired digicert TLS certs https://gerrit.wikimedia.org/r/c/operations/puppet/+/791678 [18:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:50] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder) [18:37:51] (03PS5) 10Dzahn: parsoid::testing: move apache/php auto_restarts to separate profile [puppet] - 10https://gerrit.wikimedia.org/r/800245 [18:40:59] (03CR) 10RLazarus: httpbb: Add basic tests for query_service (WDQS) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802079 (owner: 10Lucas Werkmeister (WMDE)) [18:42:53] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35721/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/800245 (owner: 10Dzahn) [18:45:50] !log testreduce1001 - systemctl reset-failed after gerrit:800245 removed failed auto_restart services for non-existing apache and php services [18:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:01] !log testreduce - re-enabling Icinga notifications that were disabled for unknown reasons [18:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:40] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on mx1001.wikimedia.org with reason: BDAT [18:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:42] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on mx1001.wikimedia.org with reason: BDAT [18:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:37] 10SRE: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10Dzahn) [18:54:02] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:54:59] 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal, 10Patch-For-Review: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10Krinkle) >>! In T299764#7813469, @jcrespo wrote: > The main issue I ran into is that it was said it was guaran... [18:55:58] 10SRE: an-tool1005 - memcached Connection refused - https://phabricator.wikimedia.org/T309886 (10Dzahn) [18:58:17] (03CR) 10Dzahn: "are these files still relevant?" [puppet] - 10https://gerrit.wikimedia.org/r/802579 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:04:52] (03CR) 10Herron: Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:07:51] (03PS1) 10Dzahn: Bug: T307142 Change-Id: I13b8b25f66cf7c384ce464bbbeb9b7a3a7dc3861 [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) [19:08:53] (03CR) 10Dzahn: "check the mysql privileges with DBA. would be good to know whether the new instance can or can not write to the same DB" [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:11:08] (03CR) 10CI reject: [V: 04-1] Bug: T307142 Change-Id: I13b8b25f66cf7c384ce464bbbeb9b7a3a7dc3861 [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [19:11:15] (03PS2) 10Dzahn: logtash: replace gitlab1001 with gitlab1004 in tests [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) [19:14:25] (03PS1) 10Dzahn: gitlab/acme_chief: remove gitlab1001 from list of (passive) hosts [puppet] - 10https://gerrit.wikimedia.org/r/802822 (https://phabricator.wikimedia.org/T307142) [19:16:03] (03PS1) 10Dzahn: gitlab::dump: delete class [puppet] - 10https://gerrit.wikimedia.org/r/802823 (https://phabricator.wikimedia.org/T274463) [19:16:23] (03PS51) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [19:16:44] (03PS2) 10Dzahn: gitlab::dump: delete role and profile classes [puppet] - 10https://gerrit.wikimedia.org/r/802823 (https://phabricator.wikimedia.org/T274463) [19:16:48] (03CR) 10Herron: [C: 03+1] wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [19:17:42] (03PS1) 10Dzahn: DHCP: remove gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/802824 (https://phabricator.wikimedia.org/T307142) [19:19:53] (03PS1) 10Dzahn: site: remove gitlab1001, adjust gitlab machine descriptions [puppet] - 10https://gerrit.wikimedia.org/r/802846 (https://phabricator.wikimedia.org/T307142) [19:20:27] (03CR) 10Herron: Add role::netmon to the netmon1003 instance. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802593 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:20:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T298560)', diff saved to https://phabricator.wikimedia.org/P29379 and previous config saved to /var/cache/conftool/dbconfig/20220603-192042-ladsgroup.json [19:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:47] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [19:22:27] (03PS1) 10Dzahn: site: remove gitlab_dump role from gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/802847 (https://phabricator.wikimedia.org/T274463) [19:23:03] (03PS2) 10Dzahn: site: remove gitlab_dump role from gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/802847 (https://phabricator.wikimedia.org/T274463) [19:23:08] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:24:05] (03CR) 10Dzahn: [C: 03+2] site: remove gitlab_dump role from gitlab2002 [puppet] - 10https://gerrit.wikimedia.org/r/802847 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [19:26:37] (03CR) 10Dzahn: [C: 03+2] "Filebucketed /etc/ferm/conf.d/10_bacula-file-daemon-backup1001.eqiad.wmnet to puppet ..etc we had backup::host on this but not a fileset." [puppet] - 10https://gerrit.wikimedia.org/r/802847 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [19:29:37] !log gitlab2002 - stop rsync service, apt-get remove --purge rsync, delete /etc/rsync.d/ and /etc/rsyncd.conf - after gerrit:802847 T274463 [19:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:42] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [19:30:39] (03CR) 10Dzahn: [C: 03+1] "not used after https://gerrit.wikimedia.org/r/c/operations/puppet/+/802847" [puppet] - 10https://gerrit.wikimedia.org/r/802823 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [19:34:19] (03PS1) 10Dzahn: vrts: rename daemon resource and template from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802849 (https://phabricator.wikimedia.org/T293942) [19:35:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29380 and previous config saved to /var/cache/conftool/dbconfig/20220603-193547-ladsgroup.json [19:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:30] (03PS1) 10Dzahn: vrts: rename systemd timer to train spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) [19:37:17] (03CR) 10Cwhite: "These files are test fixtures that have no effect on production. This change by itself is safe." [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [19:39:23] (03PS1) 10Dzahn: vrts: rename exim4 templates from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802851 (https://phabricator.wikimedia.org/T293942) [19:40:06] (03CR) 10Dzahn: "makes sense. thank you!. I would like to merge it anyways just to clean up occurences of the machine name then." [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [19:44:47] (03PS1) 10Dzahn: vrts: rename TicketExport2Mbox file [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) [19:47:29] (03PS1) 10Dzahn: vrts: rename ferm services from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802853 (https://phabricator.wikimedia.org/T293942) [19:50:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29381 and previous config saved to /var/cache/conftool/dbconfig/20220603-195052-ladsgroup.json [19:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:08] (03PS1) 10Dzahn: mx: rename OTRS database related variables [puppet] - 10https://gerrit.wikimedia.org/r/802854 (https://phabricator.wikimedia.org/T293942) [19:54:16] (03PS2) 10Dzahn: mx: rename OTRS database related variables [puppet] - 10https://gerrit.wikimedia.org/r/802854 (https://phabricator.wikimedia.org/T293942) [19:54:30] (03CR) 10Dzahn: [C: 03+2] logtash: replace gitlab1001 with gitlab1004 in tests [puppet] - 10https://gerrit.wikimedia.org/r/802821 (https://phabricator.wikimedia.org/T307142) (owner: 10Dzahn) [19:55:09] (03PS10) 10Brennen Bearnes: GitLab: enable container registry [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) [20:05:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T298560)', diff saved to https://phabricator.wikimedia.org/P29382 and previous config saved to /var/cache/conftool/dbconfig/20220603-200557-ladsgroup.json [20:06:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [20:06:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [20:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:03] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [20:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298560)', diff saved to https://phabricator.wikimedia.org/P29383 and previous config saved to /var/cache/conftool/dbconfig/20220603-200606-ladsgroup.json [20:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:00] (03CR) 10Jdlrobson: [C: 04-1] Remove 6 deprecated ResourceLoader skin modules in core (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802578 (https://phabricator.wikimedia.org/T304322) (owner: 10Mabualruz) [20:20:42] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Dzahn) > bundle exec rake 'spdx:convert:module[MODULENAME]' Is there any way to install the ruby gem "puppet" from a Debian package? `Could not find gem 'puppet (= 5.5.1... [20:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:10:16] (03PS1) 10Cwhite: opensearch: add support for managing opensearch 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/802862 (https://phabricator.wikimedia.org/T304440) [21:10:18] (03PS1) 10Cwhite: beta-logs: change opensearch version to 2.0 [puppet] - 10https://gerrit.wikimedia.org/r/802863 (https://phabricator.wikimedia.org/T304440) [21:34:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298560)', diff saved to https://phabricator.wikimedia.org/P29384 and previous config saved to /var/cache/conftool/dbconfig/20220603-213423-ladsgroup.json [21:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:28] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [21:36:31] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: restart to enable S3 plugin - bking@cumin1001 - T309720 [21:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:35] T309720: Deploy S3 plugin on all Search team-managed Elastic hosts - https://phabricator.wikimedia.org/T309720 [21:41:39] (03PS1) 10Brennen Bearnes: tag-release.sh: add some logging, more rigorous tag push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/802868 [21:42:12] (03PS1) 10Brennen Bearnes: tag-release.sh: add some logging, more rigorous tag push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/802869 [21:43:46] (03Abandoned) 10Brennen Bearnes: tag-release.sh: add some logging, more rigorous tag push [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/802868 (owner: 10Brennen Bearnes) [21:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29385 and previous config saved to /var/cache/conftool/dbconfig/20220603-214928-ladsgroup.json [21:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29386 and previous config saved to /var/cache/conftool/dbconfig/20220603-220433-ladsgroup.json [22:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298560)', diff saved to https://phabricator.wikimedia.org/P29387 and previous config saved to /var/cache/conftool/dbconfig/20220603-221938-ladsgroup.json [22:19:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:19:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:45] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [22:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:48] (03PS4) 10RLazarus: slo: Correct queries for error budget remaining [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) [23:20:33] (03CR) 10RLazarus: slo: Correct queries for error budget remaining (034 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/802646 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [23:24:21] (03PS1) 10Cwhite: add new index pattern format [software/ecs] - 10https://gerrit.wikimedia.org/r/802873 (https://phabricator.wikimedia.org/T305175) [23:53:50] (03PS1) 10Dzahn: vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) [23:54:44] (03CR) 10CI reject: [V: 04-1] vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [23:55:00] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/35722/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/802849 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [23:56:34] (03PS2) 10Dzahn: vrts: delete idle_agent_report [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) [23:59:16] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/35723/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/802850 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)