[00:03:57] ACKNOWLEDGEMENT - DNS on mw1448.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.65.1.26 daniel_zahn https://phabricator.wikimedia.org/T296041 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:06:23] !log lvs3005 - disabling puppet and stopping pybal (traffic will go to lvs3007) [00:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:00] RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [00:11:18] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [00:11:56] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:12:20] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:13:00] PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [00:13:00] ACKNOWLEDGEMENT - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) daniel_zahn known https://wikitech.wikimedia.org/wiki/PyBal [00:13:04] (03CR) 10Dzahn: "[ldap-corp1001:~] $ /usr/bin/ldapsearch -x "mail=mmartorana*" | grep -E 'employee|mail|manager'" [puppet] - 10https://gerrit.wikimedia.org/r/740278 (https://phabricator.wikimedia.org/T295789) (owner: 10Dzahn) [00:13:10] PROBLEM - pybal on lvs3005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [00:14:27] mutante: turning off my VPN and let me see [00:14:57] Wfm now. Thanks. [00:17:16] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3007 is CRITICAL: cpu={1,11,13,15,3,5,7,9} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3007&var-datasource=esams+prometheus/ops [00:17:37] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission thumbor1004.eqiad.wmnet - https://phabricator.wikimedia.org/T285480 (10wiki_willy) a:03Cmjohnson [00:17:58] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10wiki_willy) a:03Cmjohnson [00:18:24] I assume that huge host of bad-looking graphs on lvs3007 is being looked into? [00:18:46] urbanecm: cool, ty! [00:19:12] Np mutante [00:19:13] perryprog: yes, that's happening [00:19:19] ✅ [00:19:34] 10SRE, 10ops-codfw, 10decommission-hardware, 10serviceops: decommission thumbor200[12].codfw.wmnet - https://phabricator.wikimedia.org/T273141 (10wiki_willy) a:03Papaul [00:20:24] perryprog: as far as we can tell it stopped being slow for users, Europe was affected [00:20:43] additional lvs alerts are being looked into [00:21:26] I'm not in Europe so sounds good to me ;). Just making sure (and because I'm curious); better safe than something on fire and all that. [00:21:48] yep, *nod*, ty [00:23:59] !log cdanis@cumin1001 START - Cookbook sre.network.cf [00:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:02] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [00:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:02] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=12) daniel_zahn known https://wikitech.wikimedia.org/wiki/PyBal [00:25:02] ACKNOWLEDGEMENT - pybal on lvs3005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal daniel_zahn known https://wikitech.wikimedia.org/wiki/PyBal [00:25:37] !log legoktm@cumin1001 START - Cookbook sre.network.cf [00:25:38] !log legoktm@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [00:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:49] !log lvs3005 - re-enabling puppet + pybal [00:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:30] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:26:59] !log cdanis@cumin1001 START - Cookbook sre.network.cf [00:27:00] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [00:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:06] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:27:30] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 441, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:27:41] (KubernetesCalicoDown) firing: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [00:28:04] RECOVERY - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3007 is OK: All metrics within thresholds. https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3007&var-datasource=esams+prometheus/ops [00:28:22] RECOVERY - pybal on lvs3005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [00:31:30] RECOVERY - PyBal connections to etcd on lvs3005 is OK: OK: 12 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [00:48:29] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/732438 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [00:49:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:50:36] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10Dzahn) 05Open→03Resolved p:05Triage→03High a:03Dzahn @mmartorana Welcome to WMF! You have been added to the wmf LDAP group. Take a look at... [00:50:47] (03CR) 10Cwhite: [C: 03+2] add stack.head field for aggregating events by stack head [software/ecs] - 10https://gerrit.wikimedia.org/r/734698 (https://phabricator.wikimedia.org/T288851) (owner: 10Cwhite) [00:51:19] (03Merged) 10jenkins-bot: add stack.head field for aggregating events by stack head [software/ecs] - 10https://gerrit.wikimedia.org/r/734698 (https://phabricator.wikimedia.org/T288851) (owner: 10Cwhite) [00:53:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:58:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:01:14] RECOVERY - mailman list info on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 27 Dec 2021 09:00:28 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:02:30] !log lists1001 - restarted apache, icinga alerts for the web UI, but recovered [01:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:59] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10Dzahn) a:03mmartorana [01:07:30] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Dzahn) a:05Jelto→03DAbad [01:08:40] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10Dzahn) a:03Rosalie_WMDE [01:10:22] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) a:03Daimona [01:17:24] mutante: thank you [01:57:32] legoktm: np, not 100% sure if it recovered right before or that or becuase of that [02:10:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) 12 out or 18 hosts on order T291998 were shipped today. We will be receiving those servers soon. [02:36:19] (03CR) 10Juan90264: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740186 (https://phabricator.wikimedia.org/T296073) (owner: 104nn1l2) [02:52:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2044-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [04:12:32] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 (10Dzahn) added to [[ https://phabricator.wikimedia.org/project/members/61/ | Phabricator WMF-NDA ]] @mmartorana ^ This means access to non-public ticke... [04:27:41] (KubernetesCalicoDown) firing: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [04:39:04] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:41:14] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:40:42] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:00] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:45:16] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.00 ms [06:52:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2044-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [06:56:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:14] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:06:26] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.69 ms [07:21:22] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:27:42] (KubernetesCalicoDown) firing: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [08:49:02] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [09:17:30] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10RhinosF1) [09:22:24] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Urbanecm) `Caused by: org.postgresql.util.PSQLException: SSL error: PKIX... [09:25:19] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10Urbanecm) ` Nov 20 07:16:02 deployment-puppetdb03 puppet-agent[18152]: Lo... [09:27:02] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10thcipriani) The puppet failure on the second deployment host is: ` thcip... [09:43:11] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/740187 (owner: 10Muehlenhoff) [10:52:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2044-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [12:27:42] (KubernetesCalicoDown) firing: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [12:36:18] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:35] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Daimona) >>! In T295993#7517360, @Dzahn wrote: > Is there a specific thing that is not currently working? No, and > Or is this mostly just about a transition from volunteer to employee? yes, just ab... [14:52:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2044-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [15:17:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2044-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [16:15:27] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Scap: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd - https://phabricator.wikimedia.org/T296125 (10LucasWerkmeister) Web requests to the Beta cluster (e.g. https://en.wikip... [16:23:10] (03PS4) 10Thiemo Kreuz (WMDE): Streamline/modernize code in MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737857 [16:27:41] (KubernetesCalicoDown) firing: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [16:32:34] (03PS1) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304 [16:35:29] (03PS1) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 [16:36:31] (03PS7) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [16:38:36] (03CR) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [16:40:20] (03PS1) 10Majavah: opentack: add keystone auth to remaining proxy api users [puppet] - 10https://gerrit.wikimedia.org/r/740306 (https://phabricator.wikimedia.org/T295234) [16:41:05] (03CR) 10jerkins-bot: [V: 04-1] opentack: add keystone auth to remaining proxy api users [puppet] - 10https://gerrit.wikimedia.org/r/740306 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [16:45:45] (03PS2) 10Majavah: opentack: add keystone auth to remaining proxy api users [puppet] - 10https://gerrit.wikimedia.org/r/740306 (https://phabricator.wikimedia.org/T295234) [16:46:27] (03CR) 10jerkins-bot: [V: 04-1] opentack: add keystone auth to remaining proxy api users [puppet] - 10https://gerrit.wikimedia.org/r/740306 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [16:48:38] (03PS3) 10Majavah: opentack: add keystone auth to remaining proxy api users [puppet] - 10https://gerrit.wikimedia.org/r/740306 (https://phabricator.wikimedia.org/T295234) [16:49:20] (03CR) 10jerkins-bot: [V: 04-1] opentack: add keystone auth to remaining proxy api users [puppet] - 10https://gerrit.wikimedia.org/r/740306 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [16:49:43] (03CR) 10Hoo man: [C: 03+1] Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304 (owner: 10Thiemo Kreuz (WMDE)) [16:51:51] (03PS4) 10Majavah: opentack: add keystone auth to remaining proxy api users [puppet] - 10https://gerrit.wikimedia.org/r/740306 (https://phabricator.wikimedia.org/T295234) [17:09:55] (03PS1) 10Majavah: encapi: Remove statsd metrics [puppet] - 10https://gerrit.wikimedia.org/r/740307 [17:10:30] (03CR) 10jerkins-bot: [V: 04-1] encapi: Remove statsd metrics [puppet] - 10https://gerrit.wikimedia.org/r/740307 (owner: 10Majavah) [17:11:17] (03PS2) 10Majavah: encapi: Remove statsd metrics [puppet] - 10https://gerrit.wikimedia.org/r/740307 [17:12:12] (03CR) 10jerkins-bot: [V: 04-1] encapi: Remove statsd metrics [puppet] - 10https://gerrit.wikimedia.org/r/740307 (owner: 10Majavah) [19:09:22] The beta wikis seem to be down for a bit [19:09:41] en.wikisource.beta.wmflabs.org (atleast) [19:17:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2044-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [19:39:38] Sohom_Datta: not an SRE issue [19:39:41] But we're aware [19:39:55] https://phabricator.wikimedia.org/T296127 tracks [19:40:08] Tbh it's beta so it's no one's issue [19:40:56] You can blame elukey [20:12:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:14:28] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:27:42] (KubernetesCalicoDown) firing: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [20:55:30] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:59:46] (03PS1) 10AOkoth: hieradata: add kubestage bgp peers [puppet] - 10https://gerrit.wikimedia.org/r/740314 (https://phabricator.wikimedia.org/T293729) [21:00:42] (03CR) 10AOkoth: [C: 03+2] hieradata: add kubestage bgp peers [puppet] - 10https://gerrit.wikimedia.org/r/740314 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [21:05:52] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [21:10:45] 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Zabe) [21:10:54] 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Zabe) >>! In T290194#7451977, @Umherirre... [21:16:18] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [21:16:49] Is that useful every 10 minutes? [21:17:27] (KubernetesCalicoDown) resolved: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [21:26:42] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [21:44:12] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:49:33] 10SRE, 10Platform Engineering, 10Traffic, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Umherirrender) >>! In T290194#7518495, @... [21:53:44] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:10:50] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:21:46] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:30:32] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:32:13] 10SRE, 10Platform Engineering, 10Traffic, 10Patch-For-Review, 10Wikimedia-production-error: Wikimedia\Assert\PostconditionException: Postcondition failed: makeTitleSafe() should always return a Title for the text returned by getRootText(). - https://phabricator.wikimedia.org/T290194 (10Umherirrender) The... [23:17:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2044-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org