[00:15:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:39] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [00:36:23] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [00:38:06] !log dpifke@deploy1002 Started deploy [performance/navtiming@88f12a0]: Re-deploy fixed CpuBenchmark (T281243) [00:38:13] !log dpifke@deploy1002 Finished deploy [performance/navtiming@88f12a0]: Re-deploy fixed CpuBenchmark (T281243) (duration: 00m 06s) [00:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:17] T281243: Expose CPU benchmark data to Prometheus/Grafana - https://phabricator.wikimedia.org/T281243 [00:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:13] !log dpifke@deploy1002 Started deploy [performance/navtiming@88f12a0]: Revert CpuBenchmark again (T281243) [00:39:18] !log dpifke@deploy1002 Finished deploy [performance/navtiming@88f12a0]: Revert CpuBenchmark again (T281243) (duration: 00m 05s) [00:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:57] PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1146.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:03:19] !log rzl@cumin2001 conftool action : get/pooled; selector: service=docker-registry [02:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:06:25] 10SRE, 10Anti-Harassment, 10Traffic: Enable automatic redirection to the mobile version of votewiki - https://phabricator.wikimedia.org/T288938 (10Niharika) 05Open→03Resolved a:03Niharika Thank you indeed! [04:19:53] RECOVERY - MariaDB Replica Lag: s4 on db2097 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:38:52] !log Drop user2 from s6 - T289051 [04:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:02] T289051: mysql.user2 table present on s1, s2, s6 - https://phabricator.wikimedia.org/T289051 [04:47:10] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Krinkle) [04:47:38] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Krinkle) [04:47:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): lvs2007, lvs2009, lvs2010 should not be on the same row A switch - https://phabricator.wikimedia.org/T286879 (10Krinkle) [04:47:54] 10SRE, 10Traffic, 10Sustainability (Incident Followup): LVS should handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10Krinkle) [04:48:51] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Niharika) [04:49:23] 10SRE, 10ops-codfw, 10DC-Ops, 10netops, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Krinkle) [04:49:29] 10SRE, 10Traffic, 10Sustainability (Incident Followup): LVS should handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10Krinkle) [04:49:35] Krinkle: isn't https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-07-16_codfw_network a duplicate of https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-07-20_asw-a2-codfw_crash? [04:49:35] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Krinkle) [04:54:11] majavah: yeah, I just noticed it. It was on a differnet date [04:54:13] fixed [05:07:53] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [05:16:50] (03PS1) 10Marostegui: site.pp: Clarify task for the m5 master failover. [puppet] - 10https://gerrit.wikimedia.org/r/713572 [05:18:19] (03CR) 10Marostegui: [C: 03+2] site.pp: Clarify task for the m5 master failover. [puppet] - 10https://gerrit.wikimedia.org/r/713572 (owner: 10Marostegui) [05:25:09] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [06:37:27] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10toan) >>! In T288355#7286336, @thcipriani wrote: >>>! In T288355#7286300, @RobH wrote: >> Ok, for the history of this group, I think we need the following approvals: >> >> [] - access... [06:49:35] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=labsnfs file=node_directory_size_bytes.prom instance=labstore1004 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [07:30:46] (03PS1) 10H.krishna123: [WIP]: Add basic tox config Add Tox config into bernard repository along with the test-requirements.txt so that we can get CI up and running [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) [07:32:24] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [07:33:52] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [07:36:28] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [07:36:58] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [07:38:27] (03PS2) 10H.krishna123: [WIP]: Add basic tox config Add Tox config into bernard repository along with the test-requirements.txt so that we can get CI up and running [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) [07:38:52] (03PS3) 10H.krishna123: [WIP]: Add basic tox config [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) [07:41:41] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [07:43:42] (03PS3) 10Vgutierrez: varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 [08:00:40] (03CR) 10Jelto: "lgtm, however my helmfile experience is limited. So I will add Janis as a reviewer too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [08:00:47] (03CR) 10Majavah: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [08:02:10] (03CR) 10Majavah: "You need to get yourself added to the allowlist on https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/refs/heads/master/z" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [08:03:05] (03CR) 10Majavah: [WIP]: Add basic tox config (031 comment) [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [08:05:40] (03CR) 10Marostegui: [WIP]: Add basic tox config (031 comment) [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [08:12:41] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:14:17] PROBLEM - cassandra CQL 10.64.48.154:9042 on maps1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [08:15:33] (03CR) 10JMeybohm: [C: 04-1] miscweb: add helmfile.yaml and values under services.d (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [08:15:38] 10SRE-swift-storage, 10Maps, 10serviceops, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10fgiunchedi) I can confirm that's indeed the Bullseye upgrade, good find @Jgiannelos ! [08:16:36] 10SRE-swift-storage, 10Maps, 10serviceops, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10fgiunchedi) cc @elukey as I know he'll be using v4 signatures too with thanos-swift [08:18:01] RECOVERY - cassandra CQL 10.64.48.154:9042 on maps1004 is OK: TCP OK - 0.001 second response time on 10.64.48.154 port 9042 https://phabricator.wikimedia.org/T93886 [08:18:12] (03PS4) 10Vgutierrez: varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 [08:18:13] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 66, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:18:26] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 (10Addshore) {meme, src="seal-of-approval"} [08:18:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) (owner: 10RLazarus) [08:28:35] (03CR) 10H.krishna123: [WIP]: Add basic tox config (031 comment) [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [08:29:16] (03PS1) 10Jelto: hieradata::hosts::mw1 cleanup old canary api server hieradata [puppet] - 10https://gerrit.wikimedia.org/r/713607 [08:30:12] (03PS2) 10Jelto: hieradata::hosts::mw1 cleanup old canary api server hieradata [puppet] - 10https://gerrit.wikimedia.org/r/713607 (https://phabricator.wikimedia.org/T280203) [08:30:24] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10fgiunchedi) [08:30:40] 10ops-codfw, 10Maps: maps2005 power suply failure since a week - https://phabricator.wikimedia.org/T289113 (10ayounsi) p:05Triage→03High [08:34:03] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:34:08] (03PS1) 10Filippo Giunchedi: swift: add support for loopback storage device [puppet] - 10https://gerrit.wikimedia.org/r/713608 (https://phabricator.wikimedia.org/T288937) [08:34:12] (03PS1) 10Filippo Giunchedi: swift: stop carrying drive-audit patch starting with Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713609 (https://phabricator.wikimedia.org/T288937) [08:34:14] (03PS1) 10Filippo Giunchedi: swift: disable ecdhe curve in tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/713610 (https://phabricator.wikimedia.org/T279637) [08:34:34] (03PS3) 10Kormat: utils: Add support for Hosts: comments to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/710932 [08:34:48] XioNoX: ^^is that expected? [08:35:13] I'd bet on a SNMP fluke [08:35:17] let it re-check [08:35:30] ack [08:35:39] (03CR) 10Filippo Giunchedi: "I'll do the pool/depool dance of frontend hosts at deploy time since the patch restarts nginx" [puppet] - 10https://gerrit.wikimedia.org/r/713610 (https://phabricator.wikimedia.org/T279637) (owner: 10Filippo Giunchedi) [08:36:08] (03CR) 10MVernon: [C: 04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/713230 (https://phabricator.wikimedia.org/T288937) (owner: 10Filippo Giunchedi) [08:36:16] (03CR) 10Kormat: "[+jbond]" [puppet] - 10https://gerrit.wikimedia.org/r/710932 (owner: 10Kormat) [08:36:58] (03CR) 10Jelto: "mw1276-mw1279 had dedicated hieradata. I removed most of the configuration for the new canary api servers mw1447-mw1450 except for two set" [puppet] - 10https://gerrit.wikimedia.org/r/713607 (https://phabricator.wikimedia.org/T280203) (owner: 10Jelto) [08:38:36] ACKNOWLEDGEMENT - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 582 MB (0% inode=99%): ayounsi https://phabricator.wikimedia.org/T289123 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [08:39:01] ACKNOWLEDGEMENT - IPMI Sensor Status on maps2005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T289113 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:39:51] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 41, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:40:27] XioNoX: you were right, as always <3 [08:45:31] PROBLEM - cassandra CQL 10.64.48.154:9042 on maps1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [08:45:43] (03PS2) 10Filippo Giunchedi: swift: ship uwsgi config for account/container server [puppet] - 10https://gerrit.wikimedia.org/r/713230 (https://phabricator.wikimedia.org/T288937) [08:45:45] (03CR) 10Filippo Giunchedi: "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/713230 (https://phabricator.wikimedia.org/T288937) (owner: 10Filippo Giunchedi) [08:47:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, 10Kubernetes: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) a:03JMeybohm [08:47:44] (03PS1) 10Ayounsi: BGP Icinga check, critical for k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/713611 (https://phabricator.wikimedia.org/T289111) [08:57:22] !log joal@deploy1002 Started deploy [analytics/refinery@88c6618]: Regular analytics weekly train [analytics/refinery@88c6618] [08:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:03] RECOVERY - cassandra CQL 10.64.48.154:9042 on maps1004 is OK: TCP OK - 0.001 second response time on 10.64.48.154 port 9042 https://phabricator.wikimedia.org/T93886 [09:00:18] (03CR) 10MVernon: [C: 03+1] swift: ship uwsgi config for account/container server (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713230 (https://phabricator.wikimedia.org/T288937) (owner: 10Filippo Giunchedi) [09:08:59] PROBLEM - cassandra CQL 10.64.48.154:9042 on maps1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [09:13:35] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:16:39] RECOVERY - cassandra CQL 10.64.48.154:9042 on maps1004 is OK: TCP OK - 0.000 second response time on 10.64.48.154 port 9042 https://phabricator.wikimedia.org/T93886 [09:16:52] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [09:19:14] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, and 2 others: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) This happened while I was running docker pull tests 2021-07-21 ~15:04Z and kubernetes1005 is one of the dedicated sessionstore nodes runnin... [09:22:07] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:22:33] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={create,get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:23:35] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:24:05] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:25:33] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:26:29] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:26:52] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:29:51] !log joal@deploy1002 Finished deploy [analytics/refinery@88c6618]: Regular analytics weekly train [analytics/refinery@88c6618] (duration: 32m 29s) [09:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:08] !log joal@deploy1002 Started deploy [analytics/refinery@88c6618] (thin): Regular analytics weekly train THIN [analytics/refinery@88c6618] [09:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:16] !log joal@deploy1002 Finished deploy [analytics/refinery@88c6618] (thin): Regular analytics weekly train THIN [analytics/refinery@88c6618] (duration: 00m 07s) [09:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:25] !log joal@deploy1002 Started deploy [analytics/refinery@88c6618] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@88c6618] [09:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:31] (03CR) 10MVernon: [C: 03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/713610 (https://phabricator.wikimedia.org/T279637) (owner: 10Filippo Giunchedi) [09:36:13] !log joal@deploy1002 Finished deploy [analytics/refinery@88c6618] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@88c6618] (duration: 05m 48s) [09:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:31] (03PS1) 10MVernon: Debian: Add support for bookworm as a valid codename [puppet] - 10https://gerrit.wikimedia.org/r/713615 [09:45:21] (03CR) 10Ladsgroup: 08-start-maintenance: Remove cron-specific maintenance implementation details (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/713532 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [09:46:53] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [09:50:38] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2383.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202108180950_jiji_30353.log`. [09:52:51] (03CR) 10MVernon: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/713609 (https://phabricator.wikimedia.org/T288937) (owner: 10Filippo Giunchedi) [09:54:26] (03PS1) 10JMeybohm: sre/kubernetes: Add alerting for nodes not running calico [alerts] - 10https://gerrit.wikimedia.org/r/713616 (https://phabricator.wikimedia.org/T289111) [09:55:58] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 (10Ladsgroup) Only https://gerrit.wikimedia.org/r/c/wikidata/query-builder/+/713306 is waiting to be merged and then it can go to te... [09:56:53] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [10:01:31] (03CR) 10MVernon: [C: 03+1] "This one was a bit more involved! LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/713608 (https://phabricator.wikimedia.org/T288937) (owner: 10Filippo Giunchedi) [10:04:13] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/713611 (https://phabricator.wikimedia.org/T289111) (owner: 10Ayounsi) [10:10:04] (03CR) 10Ayounsi: [C: 03+2] BGP Icinga check, critical for k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/713611 (https://phabricator.wikimedia.org/T289111) (owner: 10Ayounsi) [10:10:54] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2383.codfw.wmnet with reason: REIMAGE [10:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:21] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2383.codfw.wmnet with reason: REIMAGE [10:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:47] PROBLEM - cassandra CQL 10.64.48.154:9042 on maps1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [10:18:37] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on maps1004.eqiad.wmnet with reason: Awaiting decommissioning [10:18:39] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on maps1004.eqiad.wmnet with reason: Awaiting decommissioning [10:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:48] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: ship uwsgi config for account/container server [puppet] - 10https://gerrit.wikimedia.org/r/713230 (https://phabricator.wikimedia.org/T288937) (owner: 10Filippo Giunchedi) [10:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nice" [alerts] - 10https://gerrit.wikimedia.org/r/713616 (https://phabricator.wikimedia.org/T289111) (owner: 10JMeybohm) [10:24:29] RECOVERY - cassandra CQL 10.64.48.154:9042 on maps1004 is OK: TCP OK - 7.254 second response time on 10.64.48.154 port 9042 https://phabricator.wikimedia.org/T93886 [10:30:56] (03PS1) 10ZPapierski: Switch log level for flink streaming updater to DEBUG [deployment-charts] - 10https://gerrit.wikimedia.org/r/713617 [10:34:25] (03CR) 10Effie Mouzeli: [C: 03+2] Switch log level for flink streaming updater to DEBUG [deployment-charts] - 10https://gerrit.wikimedia.org/r/713617 (owner: 10ZPapierski) [10:35:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:35:56] (03CR) 10David Caro: [C: 03+1] "lgtm, did not test it, just one question to verify I did not miss anything." [puppet] - 10https://gerrit.wikimedia.org/r/713495 (owner: 10Bstorm) [10:36:38] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2383.codfw.wmnet'] ` and were **ALL** successful. [10:37:03] (03Merged) 10jenkins-bot: Switch log level for flink streaming updater to DEBUG [deployment-charts] - 10https://gerrit.wikimedia.org/r/713617 (owner: 10ZPapierski) [10:40:01] (03CR) 10JMeybohm: [C: 03+2] sre/kubernetes: Add alerting for nodes not running calico [alerts] - 10https://gerrit.wikimedia.org/r/713616 (https://phabricator.wikimedia.org/T289111) (owner: 10JMeybohm) [10:41:47] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [10:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:38] (03PS1) 10ZPapierski: Bump flink-session-cluster chart to enable new log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/713618 [10:47:54] !log pooling mw2383 - T286463 [10:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:04] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [10:49:59] (03PS4) 10Hnowlan: restbase: Add a new check_disk for instance-data volume [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) [10:50:07] (03CR) 10Effie Mouzeli: [C: 03+2] Bump flink-session-cluster chart to enable new log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/713618 (owner: 10ZPapierski) [10:52:33] (03Merged) 10jenkins-bot: Bump flink-session-cluster chart to enable new log level [deployment-charts] - 10https://gerrit.wikimedia.org/r/713618 (owner: 10ZPapierski) [10:55:36] (03CR) 10Hnowlan: [C: 03+2] restbase: Add a new check_disk for instance-data volume [puppet] - 10https://gerrit.wikimedia.org/r/711135 (https://phabricator.wikimedia.org/T191659) (owner: 10Hnowlan) [10:59:52] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210818T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:26] (KubernetesCalicoDown) firing: kubernetes1005.eqiad.wmnet:9091 is not running calico-node Pod - https://alerts.wikimedia.org [11:00:27] Amir1: should we try to backport termbox now? [11:00:35] I think I can figure out how to do it [11:00:44] (probably mainly needs one more level of `git submodule update`) [11:01:00] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:01:00] (03PS1) 10Effie Mouzeli: ProductionServices: change rdb* servers in eqiad and codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713619 (https://phabricator.wikimedia.org/T280582) [11:03:18] sure [11:03:23] Lucas_WMDE: if you're comfortable do it [11:03:25] alright, then I’ll go ahead [11:03:38] hm, though I guess we wouldn’t be able to test it actually [11:03:40] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [11:03:43] since wikidata isn’t on wmf.19 yet [11:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:08] eh, let’s just wait until tomorrow [11:04:12] not that urgent [11:04:49] (03PS1) 10Lucas Werkmeister (WMDE): Update termbox [extensions/Wikibase] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713513 (https://phabricator.wikimedia.org/T286775) [11:05:19] (03CR) 10Lucas Werkmeister (WMDE): "I’ll probably deploy this tomorrow (assuming the train doesn’t get blocked, so that we can test this on Wikidata)." [extensions/Wikibase] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/713513 (https://phabricator.wikimedia.org/T286775) (owner: 10Lucas Werkmeister (WMDE)) [11:09:13] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, and 2 others: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) K8s event logs (https://logstash.wikimedia.org/goto/b16700661b703799af5ac188db2d3f5c) are pretty clear on that I created a lot of disk pres... [11:12:14] !log jiji@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2383.codfw.wmnet [11:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:33] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10jijiki) @Papaul I reimaged the server and pooled it back. It was performing horribly, which is why I didn't keep it in pooled for more than half an hour. The behaviour was the same, it seems that it is unable to scale its... [11:22:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:23:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:26:42] (03PS1) 10ZPapierski: Revert "Switch log level for flink streaming updater to DEBUG" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713622 [11:32:04] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "Switch log level for flink streaming updater to DEBUG" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713622 (owner: 10ZPapierski) [11:34:35] (03Merged) 10jenkins-bot: Revert "Switch log level for flink streaming updater to DEBUG" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713622 (owner: 10ZPapierski) [11:35:59] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [11:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:22] !log zpapierski@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [11:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:10] !log zpapierski@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [11:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:01] (03CR) 10Filippo Giunchedi: sre/kubernetes: Add alerting for nodes not running calico (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/713616 (https://phabricator.wikimedia.org/T289111) (owner: 10JMeybohm) [11:53:08] 10SRE, 10Wikidata, 10Wikidata Query Builder, 10wdwb-tech, and 4 others: Deploy query builder to microsites (on top of the wdqs-ui) - https://phabricator.wikimedia.org/T266703 (10Ladsgroup) [12:02:55] godog, I am running the same 1 thread slow backup process on eqiad [12:03:20] jynus: ack, thanks for the heads up! feel free to crank things up in eqiad if you so desire [12:03:28] for some reason, it took a bit to speed up [12:03:52] 10SRE, 10Discovery-Search: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Reedy) [12:03:56] some part of the chain must be cold, but not it it reading and uploading at reasnable speeds [12:04:29] (03CR) 10Ladsgroup: [C: 03+1] "LGTM, it seems similar patches has been done already before, that gives me confidence." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713619 (https://phabricator.wikimedia.org/T280582) (owner: 10Effie Mouzeli) [12:04:37] I will leave it slow for now, as I am in full debug mode, and then I will compare both eqiad and codfw clusters [12:08:33] *nod* [12:16:40] (03PS1) 10Kormat: db2121: Promote to s7 primary [puppet] - 10https://gerrit.wikimedia.org/r/713625 (https://phabricator.wikimedia.org/T289129) [12:17:09] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.2931 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [12:17:23] (03CR) 10Kormat: [C: 04-2] "Don't merge before maintenance window." [puppet] - 10https://gerrit.wikimedia.org/r/713625 (https://phabricator.wikimedia.org/T289129) (owner: 10Kormat) [12:19:22] (03PS1) 10Kormat: wmnet: Update s7-master to db2121 [dns] - 10https://gerrit.wikimedia.org/r/713626 (https://phabricator.wikimedia.org/T289129) [12:20:18] (03CR) 10Kormat: [C: 04-2] "Don't merge before maintenance window." [dns] - 10https://gerrit.wikimedia.org/r/713626 (https://phabricator.wikimedia.org/T289129) (owner: 10Kormat) [12:21:55] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:01] (03PS3) 10Vgutierrez: envoyproxy: Remove trailing whitespace [puppet] - 10https://gerrit.wikimedia.org/r/710494 (https://phabricator.wikimedia.org/T265880) [12:22:03] (03PS3) 10Vgutierrez: envoyproxy: Support V3 configuration API [puppet] - 10https://gerrit.wikimedia.org/r/710495 (https://phabricator.wikimedia.org/T265880) [12:22:05] (03PS4) 10Vgutierrez: envoyproxy: Add prefetched OCSP staple support [puppet] - 10https://gerrit.wikimedia.org/r/710496 (https://phabricator.wikimedia.org/T271421) [12:22:07] (03PS6) 10Vgutierrez: envoyproxy: Add dual stack cert support [puppet] - 10https://gerrit.wikimedia.org/r/710507 (https://phabricator.wikimedia.org/T271421) [12:22:09] (03PS5) 10Vgutierrez: envoyproxy: Support ciphersuite configuration [puppet] - 10https://gerrit.wikimedia.org/r/710577 (https://phabricator.wikimedia.org/T271421) [12:22:11] (03PS4) 10Vgutierrez: envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) [12:22:13] (03PS4) 10Vgutierrez: envoyproxy: Add upstream PROXY protocol support [puppet] - 10https://gerrit.wikimedia.org/r/711386 (https://phabricator.wikimedia.org/T271421) [12:22:15] (03PS4) 10Vgutierrez: envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) [12:22:17] (03PS4) 10Vgutierrez: cache: Provide an envoy STEK manager script [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) [12:22:19] (03PS5) 10Vgutierrez: envoyproxy: Provide support for UDS upstreams [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) [12:22:19] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [12:22:21] (03PS4) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [12:22:23] (03PS4) 10Vgutierrez: envoyproxy: Suport TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [12:22:25] (03PS3) 10Vgutierrez: envoyproxy: Allow setting a global lua script [puppet] - 10https://gerrit.wikimedia.org/r/713271 (https://phabricator.wikimedia.org/T271421) [12:22:27] (03PS3) 10Vgutierrez: cache: Use envoy lua API to provide TLS info [puppet] - 10https://gerrit.wikimedia.org/r/713272 (https://phabricator.wikimedia.org/T271421) [12:22:29] (03PS3) 10Vgutierrez: envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) [12:28:48] (03CR) 10Vgutierrez: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [12:33:29] 10SRE, 10Discovery-Search: Migrate Elasticsearch to Debian Buster - https://phabricator.wikimedia.org/T244736 (10Gehel) [12:35:45] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:37:29] 10SRE, 10Discovery-Search: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Gehel) [12:39:10] (03CR) 10Marostegui: [C: 03+1] db2121: Promote to s7 primary [puppet] - 10https://gerrit.wikimedia.org/r/713625 (https://phabricator.wikimedia.org/T289129) (owner: 10Kormat) [12:39:36] 10SRE, 10Discovery-Search: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Gehel) [12:39:40] (03CR) 10Marostegui: [C: 03+1] wmnet: Update s7-master to db2121 [dns] - 10https://gerrit.wikimedia.org/r/713626 (https://phabricator.wikimedia.org/T289129) (owner: 10Kormat) [12:51:01] (03CR) 10MMandere: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [12:52:29] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2941 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [12:55:17] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.2139 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:01:11] !log uploaded wmfmariadbpy 0.7.2 to apt.wm.o [13:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:22] !log Deploying wmfmariadbpy 0.7.2 T289139 [13:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:30] T289139: Deploy wmfmariadbpy 0.7.2 - https://phabricator.wikimedia.org/T289139 [13:02:16] (03PS1) 10JMeybohm: sre/kubernetes: Add runbook link for KubernetesCalicoDown [alerts] - 10https://gerrit.wikimedia.org/r/713634 (https://phabricator.wikimedia.org/T289111) [13:03:28] (03PS1) 10Jelto: profile::gitlab load rsync::server only on passive GitLab [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) [13:06:20] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30643/console" [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [13:08:44] (03CR) 10Jelto: [V: 03+1] "This should fix the change on every puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [13:19:28] (03PS2) 10Filippo Giunchedi: swift: add support for loopback storage device [puppet] - 10https://gerrit.wikimedia.org/r/713608 (https://phabricator.wikimedia.org/T288937) [13:19:41] (03CR) 10Btullis: "I think that this is a minimum required configuration of Alluxio in order to proceed with further testing." [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [13:23:18] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: Fix the location for swift s3api [deployment-charts] - 10https://gerrit.wikimedia.org/r/713493 (https://phabricator.wikimedia.org/T289076) (owner: 10Jgiannelos) [13:24:52] !log mw2383 is depooled - T286463 [13:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:01] T286463: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 [13:27:42] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Fix the location for swift s3api [deployment-charts] - 10https://gerrit.wikimedia.org/r/713493 (https://phabricator.wikimedia.org/T289076) (owner: 10Jgiannelos) [13:27:50] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephosd102[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10dcaro) There's a new switch on D5, we should be able to start racking these :) [13:28:05] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 79, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:28:47] (03CR) 10Btullis: "Tagging @moritzm, principally to double-check that my reserving of uid/gid values for a new hadoop system daemon user in admin/data.yaml c" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [13:30:26] (KubernetesCalicoDown) resolved: kubernetes1005.eqiad.wmnet:9091 is not running calico-node Pod - https://alerts.wikimedia.org [13:30:28] (03Merged) 10jenkins-bot: tegola-vector-tiles: Fix the location for swift s3api [deployment-charts] - 10https://gerrit.wikimedia.org/r/713493 (https://phabricator.wikimedia.org/T289076) (owner: 10Jgiannelos) [13:32:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30656/console" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [13:33:38] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops, and 2 others: kubernetes1005 BGP down for 3 weeks - https://phabricator.wikimedia.org/T289111 (10JMeybohm) 05Open→03Resolved Ok, really dumb situation! A bunch of (failing) sessionstore Pods are clogging all resources on kubernetes1005, leavi... [13:34:10] (03CR) 10Kormat: [C: 03+1] site: Install memcached on new memcached servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/713494 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [13:34:44] (03CR) 10Effie Mouzeli: [C: 03+2] site: Install memcached on new memcached servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/713494 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [13:35:15] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30658/console" [puppet] - 10https://gerrit.wikimedia.org/r/713608 (https://phabricator.wikimedia.org/T288937) (owner: 10Filippo Giunchedi) [13:35:40] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] swift: add support for loopback storage device [puppet] - 10https://gerrit.wikimedia.org/r/713608 (https://phabricator.wikimedia.org/T288937) (owner: 10Filippo Giunchedi) [13:36:01] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: stop carrying drive-audit patch starting with Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713609 (https://phabricator.wikimedia.org/T288937) (owner: 10Filippo Giunchedi) [13:36:08] (03PS2) 10Filippo Giunchedi: swift: stop carrying drive-audit patch starting with Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/713609 (https://phabricator.wikimedia.org/T288937) [13:37:02] (03CR) 10Effie Mouzeli: "NOOP https://puppet-compiler.wmflabs.org/compiler1003/30660/" [puppet] - 10https://gerrit.wikimedia.org/r/713492 (owner: 10Effie Mouzeli) [13:37:30] (03CR) 10Effie Mouzeli: [C: 03+2] "NOOP https://puppet-compiler.wmflabs.org/compiler1001/30661/" [puppet] - 10https://gerrit.wikimedia.org/r/713492 (owner: 10Effie Mouzeli) [13:38:30] 10SRE-swift-storage, 10Maps, 10serviceops, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) Things look better on staging now after the last deployment. [13:39:16] 10SRE-swift-storage, 10Maps, 10serviceops, 10Patch-For-Review: Tegola staging doesn't connect to swift - https://phabricator.wikimedia.org/T289076 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [13:39:17] (03PS7) 10Jgiannelos: tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 (https://phabricator.wikimedia.org/T283159) [13:41:09] !log bounce logstash on logstash100[89] [13:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:37] (03PS2) 10Effie Mouzeli: ProductionServices: change rdb* servers in eqiad and codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713619 (https://phabricator.wikimedia.org/T280582) [13:45:53] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:47:07] (03CR) 10Effie Mouzeli: [C: 03+2] ProductionServices: change rdb* servers in eqiad and codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713619 (https://phabricator.wikimedia.org/T280582) (owner: 10Effie Mouzeli) [13:48:13] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [13:50:31] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [13:51:06] (03Merged) 10jenkins-bot: tegola-vector-tiles: Connect staging to test-eqiad kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/713266 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [13:51:20] (03PS3) 10Effie Mouzeli: ProductionServices: change rdb* servers in eqiad and codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713619 (https://phabricator.wikimedia.org/T280582) [13:51:53] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:57:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:57:06] !log jiji@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:713619|ProductionServices: change rdb* servers in eqiad and codfw (T280582)]] (duration: 01m 51s) [13:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:18] T280582: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 [14:01:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:02] (03CR) 10JMeybohm: sre/kubernetes: Add alerting for nodes not running calico (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/713616 (https://phabricator.wikimedia.org/T289111) (owner: 10JMeybohm) [14:02:13] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) @jijiki thank you i will open a case with Dell [14:03:30] (03CR) 10Dzahn: [C: 03+1] "lgtm but best would be to also get +1 from Effie" [puppet] - 10https://gerrit.wikimedia.org/r/713607 (https://phabricator.wikimedia.org/T280203) (owner: 10Jelto) [14:04:39] (03CR) 10JMeybohm: [C: 03+1] site: Install memcached on new memcached servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/713494 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [14:04:49] (03CR) 10Effie Mouzeli: [C: 03+1] hieradata::hosts::mw1 cleanup old canary api server hieradata [puppet] - 10https://gerrit.wikimedia.org/r/713607 (https://phabricator.wikimedia.org/T280203) (owner: 10Jelto) [14:05:58] (03CR) 10Effie Mouzeli: [C: 03+1] "one nit, consider using only 'hieradata: cleanup blabla' in the title" [puppet] - 10https://gerrit.wikimedia.org/r/713607 (https://phabricator.wikimedia.org/T280203) (owner: 10Jelto) [14:06:56] (03PS2) 10Effie Mouzeli: site: Install memcached on new memcached servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/713494 (https://phabricator.wikimedia.org/T278225) [14:07:58] (03CR) 10Btullis: "I've tested with PCC that it's a noop for non-hadoop servers (cumin1001)" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [14:09:05] (03PS3) 10Jelto: hieradata: cleanup old canary api server [puppet] - 10https://gerrit.wikimedia.org/r/713607 (https://phabricator.wikimedia.org/T280203) [14:11:15] (03CR) 10Jelto: [C: 03+2] hieradata: cleanup old canary api server [puppet] - 10https://gerrit.wikimedia.org/r/713607 (https://phabricator.wikimedia.org/T280203) (owner: 10Jelto) [14:11:46] !log disable puppet on alerts* to avoid alert flood due to 713494 [14:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:22] (03PS3) 10Dzahn: miscweb: add helmfile.yaml and values under services.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) [14:15:55] (03CR) 10Dzahn: "ACK, removing empty values- files. I just got these because I did a "cp -r" of the _example_ dir as the README told me. Let me add a comme" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [14:19:07] RECOVERY - Long running screen/tmux on maps1009 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:22:09] (03PS1) 10Dzahn: fix whitespace and add comments to delete empty files in _example_ dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/713639 [14:23:29] 10SRE, 10Alerting, 10Infrastructure-Foundations, 10netops, 10SRE Observability (FY2021/2022-Q1): Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) [14:23:53] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/713639" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [14:26:06] !log enable puppet on alert* [14:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:54] 10SRE, 10Alerting, 10Infrastructure-Foundations, 10netops: Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) [14:28:06] 10SRE, 10Alerting: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570 (10lmata) [14:29:52] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/713641 [14:30:15] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RobH) a:05toan→03RobH [14:30:19] 10SRE, 10Infrastructure-Foundations, 10Metrics, 10CAS-SSO, 10User-jbond: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10lmata) [14:32:48] 10SRE, 10Logging, 10Security, 10User-jbond: ulog: filter out diffscan from ulog - https://phabricator.wikimedia.org/T265590 (10lmata) [14:34:17] 10SRE, 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q1): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) a:03lmata [14:35:04] (03CR) 10Jgiannelos: "Now that all nodes are going to run imposm maybe it makes sense to cleanup the default value for the lookup on `class profile::maps::apps`" [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [14:42:15] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) 05Open→03Resolved [14:42:25] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) From my side everything is done. Thanks everyone. I'm going to close this ticket. Feel free to... [14:42:59] 10SRE, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Jelto) [14:44:14] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/713641 (owner: 10Jgiannelos) [14:45:55] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) Looking at the log today i have ` 2021-08-18 10:05:19 PWR2400 Power management firmware unable to maintain power limit. Log Sequence Number: 6709 Detailed Description: The power management firmware cannot re... [14:46:45] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/713641 (owner: 10Jgiannelos) [14:47:16] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [14:49:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Andrew) [14:49:51] 10SRE, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Jelto) [14:50:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Andrew) 05Open→03Resolved This host is now pooled and presumed fixed. Thanks all! [14:50:29] (03CR) 10Filippo Giunchedi: [C: 03+2] sre/kubernetes: Add runbook link for KubernetesCalicoDown [alerts] - 10https://gerrit.wikimedia.org/r/713634 (https://phabricator.wikimedia.org/T289111) (owner: 10JMeybohm) [14:50:45] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [14:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10Kormat) Golang 1.17 will remove support for the work-around: https://golang.org/doc/go1.16#crypto/x509 [14:55:53] (03CR) 10RLazarus: 08-start-maintenance: Remove cron-specific maintenance implementation details (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/713532 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [14:55:57] (03CR) 10Ema: varnish: Handle UDS traffic properly (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713482 (owner: 10Vgutierrez) [14:58:23] (03CR) 10Bstorm: wikireplicas: remove old code for supporting monolithic replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713495 (owner: 10Bstorm) [14:59:24] (03CR) 10MSantos: [C: 03+1] maps: move configuration overrides to main configuration [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [15:05:30] (03CR) 10Ladsgroup: 08-start-maintenance: Remove cron-specific maintenance implementation details (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/713532 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [15:05:48] (03PS5) 10Vgutierrez: varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 (https://phabricator.wikimedia.org/T285374) [15:06:56] (03CR) 10Vgutierrez: varnish: Handle UDS traffic properly (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713482 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [15:09:08] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/713639 (owner: 10Dzahn) [15:11:39] (03CR) 10Dzahn: [C: 03+2] fix whitespace and add comments to delete empty files in _example_ dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/713639 (owner: 10Dzahn) [15:12:44] (03CR) 10JMeybohm: [C: 03+1] "You might want to limit staging to just one replica to not waste resources, but up to you I'd say." [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:14:46] (03Merged) 10jenkins-bot: fix whitespace and add comments to delete empty files in _example_ dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/713639 (owner: 10Dzahn) [15:15:19] (03CR) 10JMeybohm: [C: 03+1] "Sounds legit" [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [15:16:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10Cmjohnson) 05Open→03Resolved The error has not returned, if it appears again please re-open and ping me. [15:19:25] (03CR) 10Dzahn: "Yea, this would work on new hosts, but it would not stop/remove anything once it has been applied on a server in the past. You could also " [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [15:19:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: hw troubleshooting: system right cp board missing in new host backup1006 - https://phabricator.wikimedia.org/T286625 (10jcrespo) Thank you very much, Chris! [15:20:22] (03CR) 10Ema: [C: 03+1] varnish: Handle UDS traffic properly [puppet] - 10https://gerrit.wikimedia.org/r/713482 (https://phabricator.wikimedia.org/T285374) (owner: 10Vgutierrez) [15:22:17] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30693/console" [puppet] - 10https://gerrit.wikimedia.org/r/713226 (owner: 10Vgutierrez) [15:23:16] (03CR) 10Dzahn: "kind of surprised it leads to a change on every run, expected rsyncd to run just without any module nothing can pull from it, not that it " [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [15:23:24] (03PS2) 10Jdlrobson: Enable page previews on German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713013 (https://phabricator.wikimedia.org/T264305) [15:24:47] (03PS3) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711187 (https://phabricator.wikimedia.org/T288553) [15:25:52] (03CR) 10Jgiannelos: maps: move configuration overrides to main configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [15:26:38] (03CR) 10Ema: [V: 03+1 C: 03+1] varnish: Do not assume that UDS implies PROXY protocol (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713226 (owner: 10Vgutierrez) [15:31:26] (03PS2) 10RLazarus: envoyproxy: Add $runtime field to set a static runtime layer. [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) [15:31:29] (03CR) 10Jgiannelos: [C: 03+1] maps: move configuration overrides to main configuration [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [15:32:44] (03CR) 10Hnowlan: [V: 03+1] maps: move configuration overrides to main configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713451 (https://phabricator.wikimedia.org/T288810) (owner: 10Hnowlan) [15:34:29] (03PS2) 10Ema: Add Varnish SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) [15:38:41] (03PS2) 10Vgutierrez: varnish: Do not assume that UDS implies PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/713226 (https://phabricator.wikimedia.org/T285374) [15:41:21] (03CR) 10Ema: Add Varnish SLO dashboard (032 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713440 (https://phabricator.wikimedia.org/T289036) (owner: 10Ema) [15:43:27] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:07] (03CR) 10Dzahn: miscweb: add helmfile.yaml and values under services.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:45:20] (03PS4) 10Dzahn: miscweb: add helmfile.yaml and values under services.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) [15:45:31] (03PS3) 10Cwhite: openstack: add more fields to nova_fullstack_test logging [puppet] - 10https://gerrit.wikimedia.org/r/713559 [15:46:24] (03PS2) 10Jelto: profile::gitlab load rsync::server only on passive GitLab [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) [15:46:54] (03CR) 10jerkins-bot: [V: 04-1] profile::gitlab load rsync::server only on passive GitLab [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [15:48:10] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:50] (03PS3) 10Jelto: profile::gitlab load rsync::server only on passive GitLab [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) [15:49:02] (03PS3) 10RLazarus: envoyproxy: Add $runtime field to set a static runtime layer. [puppet] - 10https://gerrit.wikimedia.org/r/713504 (https://phabricator.wikimedia.org/T288815) [15:51:20] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30694/console" [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [15:54:11] (03CR) 10Dzahn: "How about a selector?" [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [15:59:43] (03CR) 10Dzahn: [C: 03+2] miscweb: add helmfile.yaml and values under services.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [16:01:17] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) 05Resolved→03Open a:05Jelto→03Cmjohnson re-opened and assigned to me to use this s... [16:02:12] (03Merged) 10jenkins-bot: miscweb: add helmfile.yaml and values under services.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/713441 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [16:02:17] (03CR) 10Andrew Bogott: [C: 03+2] wmcs.ceph: add cloudcephosd1018 as osd [puppet] - 10https://gerrit.wikimedia.org/r/711499 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [16:02:27] (03PS4) 10Andrew Bogott: wmcs.ceph: add cloudcephosd1018 as osd [puppet] - 10https://gerrit.wikimedia.org/r/711499 (https://phabricator.wikimedia.org/T285858) (owner: 10David Caro) [16:04:16] (03PS4) 10Jelto: profile::gitlab load rsync::server only on passive GitLab [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) [16:05:01] (03CR) 10Dzahn: [C: 03+1] profile::gitlab load rsync::server only on passive GitLab [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [16:06:47] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30695/console" [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [16:07:59] (03CR) 10Jelto: [V: 03+1 C: 03+2] profile::gitlab load rsync::server only on passive GitLab [puppet] - 10https://gerrit.wikimedia.org/r/713635 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [16:17:22] (03PS1) 10Jcrespo: mediabackups: Backup enwiki local originals [puppet] - 10https://gerrit.wikimedia.org/r/713651 (https://phabricator.wikimedia.org/T262668) [16:24:38] (03PS1) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) [16:24:51] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:25] 10SRE, 10MW-on-K8s, 10serviceops: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) >>! In T288848#7287882, @JMeybohm wrote: > I'd assume that MW makes HTTP calls to the public endpoints of MW. Those will be blocked in k8s as we generally prohibit egr... [16:40:23] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10Legoktm) p:05Triage→03Low >>! In T287129#7229380, @LSobanski wrote: > Certainly makes sens... [16:40:44] (03PS33) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [16:45:11] 10SRE, 10Graphite, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10lmata) @herron this sounds like should be folded into the Grizzly work or closed at this point. If you have opinions... [16:46:06] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup enwiki local originals [puppet] - 10https://gerrit.wikimedia.org/r/713651 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:49:19] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [16:51:46] (03PS1) 10Effie Mouzeli: hieradata: remove shard01 from redis_sessions [puppet] - 10https://gerrit.wikimedia.org/r/713655 (https://phabricator.wikimedia.org/T280582) [17:02:47] 10SRE, 10Graphite, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10herron) Sounds good, yes grizzly deploys the jsonnet/grafonnet approach outlined in the task description and good pr... [17:04:20] (03PS1) 10Herron: remove default id, version fields [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713658 [17:04:43] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Cmjohnson) 05Open→03Resolved all are decommissioned and removed from the rack [17:04:56] (03PS1) 10RobH: new shell user dang plus addition to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/713659 (https://phabricator.wikimedia.org/T288355) [17:06:13] (03CR) 10RobH: [C: 03+2] new shell user dang plus addition to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/713659 (https://phabricator.wikimedia.org/T288355) (owner: 10RobH) [17:06:39] (03CR) 10Herron: [V: 03+2 C: 03+2] remove default id, version fields [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713658 (owner: 10Herron) [17:07:18] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RobH) [17:08:46] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for @dang - https://phabricator.wikimedia.org/T288355 (10RobH) 05Open→03Resolved a:05RobH→03None @dang, I've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/713659 live, and it'll take about 30-60 minutes to propagate to... [17:09:42] (03PS1) 10Herron: Dashboard/slo-apigw: remove default version, id fields [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713660 [17:10:16] (03CR) 10Herron: [V: 03+2 C: 03+2] Dashboard/slo-apigw: remove default version, id fields [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713660 (owner: 10Herron) [17:13:11] (03PS1) 10Majavah: Replace distro with os release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713661 (https://phabricator.wikimedia.org/T278748) [17:14:58] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10LSobanski) @Legoktm Thanks! I'd say this is not top of our priority list right now so the cook... [17:21:06] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:24:32] (03PS2) 10Majavah: Replace distro with os release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713661 (https://phabricator.wikimedia.org/T278748) [17:27:19] (03CR) 10Legoktm: [C: 03+1] "Overall LGTM." [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [17:28:36] (03CR) 10Legoktm: [C: 03+1] "Pending spicerack patch" [cookbooks] - 10https://gerrit.wikimedia.org/r/713532 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [17:29:23] (03PS1) 10Majavah: Take OS codename into account for grid compatibility [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) [17:31:15] (03PS2) 10Majavah: Take OS codename into account for grid compatibility [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/713663 (https://phabricator.wikimedia.org/T278748) [17:38:18] (03PS1) 10Herron: add cache_type variable to template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713664 [17:39:18] (03CR) 10Herron: [V: 03+2 C: 03+2] add cache_type variable to template [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713664 (owner: 10Herron) [17:41:46] (03PS7) 10RLazarus: mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) [17:42:31] (03CR) 10RLazarus: mediawiki: Remove cron-specific maintenance implementation details (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [17:49:37] (03PS1) 10Bstorm: paws trove: fix duplicate identifier mistake [puppet] - 10https://gerrit.wikimedia.org/r/713665 [17:51:12] (03CR) 10Legoktm: [C: 03+1] mediawiki: Remove cron-specific maintenance implementation details [software/spicerack] - 10https://gerrit.wikimedia.org/r/713530 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [17:55:04] (03CR) 10Bstorm: [C: 03+2] paws trove: fix duplicate identifier mistake [puppet] - 10https://gerrit.wikimedia.org/r/713665 (owner: 10Bstorm) [17:56:23] (03PS3) 10Legoktm: sre.switchdc.mediawiki: Run the warmup cache script at least 6 times [cookbooks] - 10https://gerrit.wikimedia.org/r/707457 (https://phabricator.wikimedia.org/T285802) [18:00:04] brennen and jeena: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210818T1800). Please do the needful. [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210818T1800). [18:00:05] MatmaRex and Jdlrobson: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:14] hi [18:00:14] I can deploy today [18:00:26] hi Guest8870! [18:01:20] Guest8870: just to make it a bit easier for me, could you change nick to the ordinary one please? 🙂 [18:01:22] oh [18:01:39] i'm having some nickserv issues [18:01:39] (03PS2) 10Urbanecm: Enable DiscussionTools' topicsubscription as beta feature on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713491 (https://phabricator.wikimedia.org/T287800) (owner: 10Bartosz Dziewoński) [18:01:42] thanks! [18:01:48] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools' topicsubscription as beta feature on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713491 (https://phabricator.wikimedia.org/T287800) (owner: 10Bartosz Dziewoński) [18:01:51] (03CR) 10RLazarus: [C: 03+1] "Seems like this almost guarantees we'll never actually depend on the timing logic, since it should always converge first -- we could just " [cookbooks] - 10https://gerrit.wikimedia.org/r/707457 (https://phabricator.wikimedia.org/T285802) (owner: 10Legoktm) [18:02:34] (03Merged) 10jenkins-bot: Enable DiscussionTools' topicsubscription as beta feature on phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713491 (https://phabricator.wikimedia.org/T287800) (owner: 10Bartosz Dziewoński) [18:02:58] urbanecm: present [18:03:05] sorry for the delay [18:03:19] MatmaRex__: your patch is at mwdebug2001, can you test please? [18:03:34] hi Jdlrobson, no problem -- i'm just about to start with your code [18:03:42] looking [18:04:05] (03CR) 10Jcrespo: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [18:04:24] urbanecm: looks good [18:04:28] thanks, syncing [18:05:22] (03PS3) 10Urbanecm: Enable page previews on German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713013 (https://phabricator.wikimedia.org/T264305) (owner: 10Jdlrobson) [18:05:28] (03CR) 10Urbanecm: [C: 03+2] Enable page previews on German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713013 (https://phabricator.wikimedia.org/T264305) (owner: 10Jdlrobson) [18:06:02] (03PS4) 10Urbanecm: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711187 (https://phabricator.wikimedia.org/T288553) (owner: 10Jdlrobson) [18:06:07] (03CR) 10Urbanecm: [C: 03+2] Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711187 (https://phabricator.wikimedia.org/T288553) (owner: 10Jdlrobson) [18:06:13] (03Merged) 10jenkins-bot: Enable page previews on German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713013 (https://phabricator.wikimedia.org/T264305) (owner: 10Jdlrobson) [18:06:41] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 35113b617b3540242ac69a8285c54c70041bc14b: Enable DiscussionTools topicsubscription as beta feature on phase 1 wikis (T287800) (duration: 01m 25s) [18:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:50] T287800: Deploy Config to Introduce Manual Topic Subscriptions as Beta Feature at Phase 1 Wikis - https://phabricator.wikimedia.org/T287800 [18:06:57] (03Merged) 10jenkins-bot: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/711187 (https://phabricator.wikimedia.org/T288553) (owner: 10Jdlrobson) [18:07:21] Jdlrobson: the dewikivoyage patch is at mwdebug2001, please have a look [18:07:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:18] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Provide nodejs12-slim and -devel based on Bullseye (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/697672 (https://phabricator.wikimedia.org/T284346) (owner: 10Jforrester) [18:09:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:35] urbanecm: almost done [18:09:38] ack [18:09:40] take your time [18:11:17] urbanecm: LGTM [18:11:20] thanks, syncing [18:12:26] Jdlrobson: for deploying the new extension, that's unfortunately not possible (yet). You first need to start branching the extension in wmf branches before it can be enabled on beta [18:12:46] see https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/621242 as an example on how to do that [18:12:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 559dd701a5859223afd49aaa33ddab70e8ebe721: Enable page previews on German Wikivoyage (T264305) (duration: 01m 08s) [18:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:55] T264305: German Wikivoyage: make extensions Popups (Navigation popups) and Reference Previews available - https://phabricator.wikimedia.org/T264305 [18:13:00] then, two trains need to pass, and then we can get it to beta [18:13:22] urbanecm: thanks for the link. Can I do the branching as part of a backport or do I need a dedicated window? [18:13:44] Jdlrobson: the branching is just uploading a patch to gerrit and letting a releng member to merge it [18:13:52] but it needs to be happening for at least 2 trains [18:14:11] um, it doesn't need to be branched to enable in beta [18:14:15] legoktm: it does [18:14:20] did that change? [18:14:54] yes (not sure when, but i can look it up). It's because there's only https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/extension-list, but not an equivalent list for beta [18:15:07] oh, right [18:15:19] (and adding the extension there causes train to build i18n for it, which is why branching is needed) [18:15:24] yep yep [18:15:32] Okay I wrote the branch patch https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/713667 [18:15:42] alternatively, you can just add the submodule to the existing wmf. branches [18:15:48] instead of having to wait 2 weeks [18:15:56] (03PS1) 10Bstorm: paws trove: add the log dir [puppet] - 10https://gerrit.wikimedia.org/r/713669 [18:16:16] legoktm: in theory, but i don't really want to do that -- i hope waiting 2 weeks is possible for Jdlrobson :-) [18:16:32] I can wait 2 weeks yeh, that's not a problem. So I just come back in 2 weeks with the same patch? :) [18:16:40] !log Successfully published image docker-registry.discovery.wmnet/nodejs12-devel:0.0.1, docker-registry.discovery.wmnet/nodejs12-slim:0.0.1 (T284346) [18:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:49] T284346: Provide a node 12 production image (based on bullseye?) - https://phabricator.wikimedia.org/T284346 [18:17:10] (03PS2) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) [18:17:23] Jdlrobson: great. Pretty much so (2 weeks after that branching patch gets merged). [18:17:26] (03PS3) 10Jdlrobson: Enable NearbyPages on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) [18:17:53] the patch also needs the extension to be added to wmf-config/extension-list -- you can do that now, or later, up to you [18:18:12] https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment#Process is the docs for that [18:18:37] (03CR) 10Bstorm: [C: 03+2] paws trove: add the log dir [puppet] - 10https://gerrit.wikimedia.org/r/713669 (owner: 10Bstorm) [18:18:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:23] Jdlrobson: is there anything else I can help with today? [18:21:31] nope that's great urbanecm. Is Dan Duvall the right person for reviewing the extension list patch? Are there other reviewers I should add? [18:21:54] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:22:11] (03CR) 10Legoktm: [C: 03+2] sre.switchdc.mediawiki: Run the warmup cache script at least 6 times (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/707457 (https://phabricator.wikimedia.org/T285802) (owner: 10Legoktm) [18:24:20] Jdlrobson: for the last extension I deployed it was done by James Forrester -- not sure who usually handles that repo tbh [18:24:55] (03CR) 10Urbanecm: [C: 04-1] "The extension also needs to be in wmf-config/extension-list (the list of extensions to build i18n messages for)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713653 (https://phabricator.wikimedia.org/T246493) (owner: 10Jdlrobson) [18:25:00] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Run the warmup cache script at least 6 times [cookbooks] - 10https://gerrit.wikimedia.org/r/707457 (https://phabricator.wikimedia.org/T285802) (owner: 10Legoktm) [18:25:11] 10SRE, 10ops-codfw: mw2383 is misbehaving - https://phabricator.wikimedia.org/T286463 (10Papaul) @jijiki Please see below for the issue we found. The Power cap was enable setting the cap limits at 128watts below the recommended range of 213-355 watts) which was less then what both CPU's needed. I disable the P... [18:26:42] (03PS1) 10Bstorm: paws trove: actually install mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/713672 (https://phabricator.wikimedia.org/T267683) [18:29:53] (03CR) 10Bstorm: [C: 03+2] paws trove: actually install mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/713672 (https://phabricator.wikimedia.org/T267683) (owner: 10Bstorm) [18:35:19] (03PS1) 10Jgiannelos: maps: Allow creating ad-hoc python venvs for maps scripts [puppet] - 10https://gerrit.wikimedia.org/r/713674 [18:41:39] (03PS1) 10Herron: slo_template: add link to general documentation [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713676 [18:42:23] (03CR) 10Herron: [V: 03+2 C: 03+2] slo_template: add link to general documentation [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713676 (owner: 10Herron) [18:45:47] (03Abandoned) 10Daimona Eaytoy: Avoid passing invalid offset to mb_strpos [extensions/AbuseFilter] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/703904 (https://phabricator.wikimedia.org/T285978) (owner: 10Daimona Eaytoy) [19:00:05] brennen and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210818T1900). [19:00:41] here and rolling forward shortly [19:06:00] (03PS1) 10Brennen Bearnes: group1 wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713684 [19:06:02] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713684 (owner: 10Brennen Bearnes) [19:06:47] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713684 (owner: 10Brennen Bearnes) [19:08:41] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.19 [19:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:47] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.19 (duration: 01m 05s) [19:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:15] 10SRE, 10Datacenter-Switchover: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Krinkle) [19:38:10] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@8d71e72]: configuration for imagerec data shipping [19:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:13] 10SRE, 10MW-on-K8s, 10serviceops: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10TK-999) For the record, to resolve the same issue during our effort to upgrade Fandom's MW-on-k8s deployment, we ended up creating an HttpRequestFactory service override to dyn... [19:40:22] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@8d71e72]: configuration for imagerec data shipping (duration: 02m 12s) [19:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:42] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephosd102[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) [19:44:18] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephosd102[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) updated ticket showed 5 host for racking. but only 4 where ordered [19:44:29] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) [19:45:00] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team: Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10Legoktm) > A place to run the image which has access to production resources. The staging cluster is available for this. > A set of tests to run. Using httpbb... [19:58:07] (03CR) 10Jeena Huneidi: [C: 03+2] "Merging since it looks like all comments have been resolved" [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [19:58:36] (03CR) 10Jeena Huneidi: [C: 03+2] "Merging since it looks like all comments have been resolved" [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [20:00:05] brennen and jeena: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210818T1900). [20:00:05] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210818T2000). [20:01:13] (03Merged) 10jenkins-bot: toolhub: initial chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [20:01:23] (03Merged) 10jenkins-bot: toolhub: Add CronJob for crawler [deployment-charts] - 10https://gerrit.wikimedia.org/r/710704 (https://phabricator.wikimedia.org/T276405) (owner: 10BryanDavis) [20:20:58] PROBLEM - Ensure local MW versions match expected deployment on mw2383 is CRITICAL: CRITICAL: 526 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:28:59] 10SRE, 10 Data-Engineering, 10Analytics, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Milimetric) I'm just making sure I didn't miss anything: it looks to me like the instrumentation's just not sen... [21:02:55] (03PS1) 10Cwhite: aptrepo: add opensearch 1.x component [puppet] - 10https://gerrit.wikimedia.org/r/713701 (https://phabricator.wikimedia.org/T288618) [21:18:25] 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) [21:18:36] 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) p:05Triage→03Medium [21:20:59] (03PS1) 10Ahmon Dancy: Allow protocol for etcd server to be specified [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 [21:34:03] (03PS1) 10Herron: add SLI error and latency exceeded ratio panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713705 [21:35:21] (03CR) 10Herron: [V: 03+2 C: 03+2] add SLI error and latency exceeded ratio panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/713705 (owner: 10Herron) [21:38:30] (03CR) 10Ahmon Dancy: [C: 04-1] "not right yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [21:39:51] (03CR) 10Herron: [C: 03+1] aptrepo: add opensearch 1.x component [puppet] - 10https://gerrit.wikimedia.org/r/713701 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [21:40:34] (03PS2) 10Ahmon Dancy: Allow protocol for etcd server to be specified [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 [21:41:10] (03CR) 10Ahmon Dancy: "Ready now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/713704 (owner: 10Ahmon Dancy) [21:47:50] 10SRE, 10 Data-Engineering, 10Analytics, 10Growth-Team, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Tgr) >>! In T288853#7292309, @Mholloway wrote: > Just to be clear, you're advocating here for getting `http.cli... [22:14:19] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@26480d5]: fully enable imagerec data shipping [22:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:28] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@26480d5]: fully enable imagerec data shipping (duration: 02m 09s) [22:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:09] (03PS1) 10Cwhite: Add logstash-output-opensearch plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/713713 [22:52:26] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [22:54:11] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [22:54:43] (03CR) 10H.krishna123: "I wonder if this tox config needs to be merged to master for Jenkins CI to pickup? The repo has been already added to Zuul" [software/bernard] - 10https://gerrit.wikimedia.org/r/713604 (https://phabricator.wikimedia.org/T284404) (owner: 10H.krishna123) [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210818T2300). Please do the needful. [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:28:42] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:32:34] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring