[00:54:28] PROBLEM - SSH on wdqs2002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:07:42] PROBLEM - SSH on dns5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:55:14] RECOVERY - SSH on wdqs2002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:55:32] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:08:28] RECOVERY - SSH on dns5002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:56:18] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:27:36] PROBLEM - SSH on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:28:37] (03PS2) 10R4356thwiki: Disable indexing in NS_USER and NS_USER_TALK on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703884 (https://phabricator.wikimedia.org/T286152) [04:36:05] (03CR) 10R4356thwiki: "I added the comma anyway in case it was blocking this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703884 (https://phabricator.wikimedia.org/T286152) (owner: 10R4356thwiki) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210711T0700) [07:34:40] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:36:34] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 22 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:00:51] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10AKhatun_WMF) 05Resolved→03Open [09:02:41] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10AKhatun_WMF) Hi @akosiaris, I had to fresh install OS and lost my ssh keys. Is it possible to change it so I can regain access? Should I put on a new public key here? [09:15:15] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10Peachey88) 05Open→03Resolved @AKhatun_WMF The scope of this task has been completed, So i'm re-resolving it. Please create a new task tagged with #sre-access-requests... [09:16:55] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10Aklapper) Hi, see https://phabricator.wikimedia.org/project/profile/956/ [09:31:40] RECOVERY - SSH on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:05:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_webrequest_partitions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:05:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:05:52] 10SRE, 10SRE-Access-Requests: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10AKhatun_WMF) Thanks! [11:27:49] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key for Aisha Khatun - https://phabricator.wikimedia.org/T286410 (10Peachey88) [11:29:43] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key for Aisha Khatun - https://phabricator.wikimedia.org/T286410 (10AKhatun_WMF) [11:39:20] (03PS4) 10Amire80: Update autonyms for kea, ota, sjd in wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/699692 (https://phabricator.wikimedia.org/T284870) [11:47:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:49:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:04:28] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:40:15] (03PS3) 10H.krishna123: [WIP] api_db: Add code to enable database connection [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) [13:47:18] PROBLEM - SSH on mw1301.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:48:04] RECOVERY - SSH on mw1301.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:59:07] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10AntiCompositeNumber) I'm just trying to make sure that the requested configuration change would do what you expect it to do. I... [15:01:05] (03PS1) 10AntiCompositeNumber: VCL: Maps Referer block: allow wikimedia.it [puppet] - 10https://gerrit.wikimedia.org/r/703929 (https://phabricator.wikimedia.org/T261694) [15:05:26] PROBLEM - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:10:14] (03CR) 10Urbanecm: [C: 03+1] "LGTM with note inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703929 (https://phabricator.wikimedia.org/T261694) (owner: 10AntiCompositeNumber) [15:16:04] RECOVERY - Ensure traffic_exporter for the tls instance binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 23648 bytes in 0.325 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:57:02] (03CR) 10Zabe: [C: 03+1] Disable indexing in NS_USER and NS_USER_TALK on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703884 (https://phabricator.wikimedia.org/T286152) (owner: 10R4356thwiki) [17:06:22] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:22:04] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:29:02] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Ferdi2005) Hi, I'm using the Wikimedia tiles for the Wiki Loves Monuments Italy app by Wikimedia Italy. The HTTP user-agen... [18:07:25] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Reedy) >>! In T261694#7203638, @Ferdi2005 wrote: > Hi, I'm using the Wikimedia tiles for the Wiki Loves Monuments Italy ap... [18:09:43] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Ferdi2005) @Reedy It seems descriptive enough anyway (and includes contact information, even if indirectly) [18:12:34] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Reedy) >>! In T261694#7203646, @Ferdi2005 wrote: > @Reedy It seems descriptive enough anyway (and includes contact informa... [18:13:38] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Reedy) Also, you're not even reading the description. >**If you are using the maps.wikimedia.org API for a Wikimedia affi... [18:22:56] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:25:27] 10SRE, 10DBA, 10Datacenter-Switchover: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10RLazarus) The downtime cookbook uses [[https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/fil... [19:10:50] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:23:47] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Ferdi2005) It's a mobile application, there isn't a domain. :smile: I forgot to provide a link to the example usage, which... [19:33:26] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:36:52] PROBLEM - SSH on mw1295.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:49:33] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Ferdi2005) Ok, the library has been updated and I've set a more compliant user-agent. Now the app user-agent is: "Wiki Lo... [20:11:36] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:26:28] (03CR) 10H.krishna123: "I've rewritten the SQL query so it's parameterised. See comment on Ticket T285142, I used the built in functions to rewrite queries in cru" [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123) [20:37:40] RECOVERY - SSH on mw1295.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:36:48] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook