[00:02:37] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=8 [00:03:55] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:10:37] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [00:15:25] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:17:39] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.814 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [00:20:53] RECOVERY - SSH on furud.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:25:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:28:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:33:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:48:07] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:30:03] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:47] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:50:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:31:15] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [02:50:31] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:42:02] (03PS1) 10Ladsgroup: Stop writing to rev_actor_temp table in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790020 (https://phabricator.wikimedia.org/T275246) [03:44:05] (03PS1) 10Ladsgroup: Set arwiki to read new in templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790021 (https://phabricator.wikimedia.org/T306673) [03:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:04:03] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:17:51] (03PS1) 10Andrew Bogott: cinder-backup: apply a hack to prevent repeated backup failures [puppet] - 10https://gerrit.wikimedia.org/r/790023 [04:18:29] (03CR) 10jerkins-bot: [V: 04-1] cinder-backup: apply a hack to prevent repeated backup failures [puppet] - 10https://gerrit.wikimedia.org/r/790023 (owner: 10Andrew Bogott) [04:21:16] (03PS2) 10Andrew Bogott: cinder-backup: apply a hack to prevent repeated backup failures [puppet] - 10https://gerrit.wikimedia.org/r/790023 [04:23:16] (03CR) 10Andrew Bogott: cinder-backup: apply a hack to prevent repeated backup failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790023 (owner: 10Andrew Bogott) [04:25:24] (03CR) 10Ladsgroup: [C: 03+2] Set arwiki to read new in templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790021 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [04:26:10] (03Merged) 10jenkins-bot: Set arwiki to read new in templatelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790021 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [04:27:03] (03PS2) 10Ladsgroup: Stop writing to rev_actor_temp table in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790020 (https://phabricator.wikimedia.org/T275246) [04:27:08] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to rev_actor_temp table in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790020 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [04:27:52] (03Merged) 10jenkins-bot: Stop writing to rev_actor_temp table in group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790020 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [04:31:47] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:790021|Set arwiki to read new in templatelinks migration (T306673)]] (duration: 05m 10s) [04:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:52] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [04:34:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:29] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 260 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:37:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:37:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:40:02] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:790020|Stop writing to rev_actor_temp table in group1 (T275246)]] (duration: 05m 06s) [04:40:05] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:07] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [04:42:47] that was me [04:44:41] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 273 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:47:12] going to revert it now [04:47:37] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 05m 04s) [04:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:06] (03CR) 10ArielGlenn: [C: 03+1] "This looks fine, if you don't have merge rights I can do so whenever you like, just let me know." [puppet] - 10https://gerrit.wikimedia.org/r/789794 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [04:49:17] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 6 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:52:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:52:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:01] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: Revert: [[gerrit:790021|Set arwiki to read new in templatelinks migration (T306673)]] (duration: 05m 03s) [04:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:05] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [04:53:29] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:54:40] mw1415.eqiad.wmnet is not working, the scap gets stuck all the time [04:56:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:49] (03PS1) 10Ladsgroup: Revert "Set arwiki to read new in templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789957 [05:01:56] (03PS2) 10Ladsgroup: Revert "Set arwiki to read new in templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789957 [05:02:01] (03CR) 10Ladsgroup: [C: 03+2] Revert "Set arwiki to read new in templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789957 (owner: 10Ladsgroup) [05:02:51] (03Merged) 10jenkins-bot: Revert "Set arwiki to read new in templatelinks migration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789957 (owner: 10Ladsgroup) [05:05:13] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:06:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [05:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [05:10:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [05:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [05:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:32] (03PS1) 10Marostegui: db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790158 (https://phabricator.wikimedia.org/T307546) [05:14:03] (03CR) 10Marostegui: [C: 03+2] db1172: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790158 (https://phabricator.wikimedia.org/T307546) (owner: 10Marostegui) [05:14:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1172', diff saved to https://phabricator.wikimedia.org/P27761 and previous config saved to /var/cache/conftool/dbconfig/20220509-051426-marostegui.json [05:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:14] (03Abandoned) 10KartikMistry: ULS entrypoint: Do not show current language, fix domain redirects [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789832 (https://phabricator.wikimedia.org/T307745) (owner: 10KartikMistry) [05:48:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1172 with minimal weight to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27762 and previous config saved to /var/cache/conftool/dbconfig/20220509-054823-marostegui.json [05:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:29] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [05:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:57:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:59:34] (03Restored) 10KartikMistry: ULS entrypoint: Do not show current language, fix domain redirects [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789832 (https://phabricator.wikimedia.org/T307745) (owner: 10KartikMistry) [06:01:21] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:02:31] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.376 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:06:49] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:08:23] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 6.146 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:21:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:23:43] !log start of updateRestrictions.php on s5 (T218446) [06:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:50] T218446: Remove use of legacy page.page_restrictions field - https://phabricator.wikimedia.org/T218446 [06:24:35] (03CR) 10Giuseppe Lavagetto: mediawiki 0.2.0: Add mw.localmemcached.enabled value (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764919 (owner: 10Ahmon Dancy) [06:31:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:38:04] (03PS3) 10Slyngshede: Convert dumps-status from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/789769 (https://phabricator.wikimedia.org/T273673) [06:38:38] (03CR) 10jerkins-bot: [V: 04-1] Convert dumps-status from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/789769 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [06:48:15] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:48:49] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:50:31] (03CR) 10WhitePhosphorus: Set wmgTimelineDefaultFont to unifont for yue, wuu and zh_classical (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790016 (https://phabricator.wikimedia.org/T188997) (owner: 10Stang) [06:52:00] (03PS4) 10Slyngshede: Convert dumps-status from cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/789769 (https://phabricator.wikimedia.org/T273673) [06:57:53] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:59:14] (03PS3) 10Stang: Set wmgTimelineDefaultFont to unifont for yue, wuu and zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790016 (https://phabricator.wikimedia.org/T188997) [07:00:05] Amir1, awight, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T0700). [07:00:05] kart_ and koi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:44] 10SRE, 10serviceops: Provide node14 and node16 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10Joe) a:03Joe [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:01:59] (03PS4) 10Stang: Set wmgTimelineDefaultFont to unifont for cdo, gan, hak, wuu, yue and zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790016 (https://phabricator.wikimedia.org/T188997) [07:02:29] (03CR) 10Stang: Set wmgTimelineDefaultFont to unifont for cdo, gan, hak, wuu, yue and zh_classical (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790016 (https://phabricator.wikimedia.org/T188997) (owner: 10Stang) [07:03:01] Sorry, still figuring out to update my cherry-pick :/ [07:03:17] koi: you can go ahead with your patch. [07:03:35] ok, doing a rebase [07:04:30] (03PS5) 10Stang: Set wmgTimelineDefaultFont to unifont for cdo, gan, hak, wuu, yue and zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790016 (https://phabricator.wikimedia.org/T188997) [07:04:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1172 with minimal weight to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27763 and previous config saved to /var/cache/conftool/dbconfig/20220509-070430-marostegui.json [07:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:36] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [07:06:25] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:06:48] sorry please wait for a while, found something missing... [07:06:55] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:07:07] (03Abandoned) 10Muehlenhoff: ganeti.addnode: Fix up bridge detection for Bullseye changes [cookbooks] - 10https://gerrit.wikimedia.org/r/786828 (owner: 10Muehlenhoff) [07:09:03] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 8.141 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:09:28] kart_, you could do it first 0 0 [07:09:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (but strictly speaking needs a +1 by Kwaku or someone else listed unter "approval")." [puppet] - 10https://gerrit.wikimedia.org/r/789727 (owner: 10BCornwall) [07:10:45] (03CR) 10ArielGlenn: [C: 03+1] "Looks great, merge at your convenience." [puppet] - 10https://gerrit.wikimedia.org/r/789769 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:11:01] (03PS1) 10Hashar: [WMF] update Zuul plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/790261 (https://phabricator.wikimedia.org/T307621) [07:13:13] koi: no. my patch isn't ready. [07:13:37] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:13:58] (03CR) 10Stang: "For reference: If48c1338304a2257d02e095b4faefc6e4af44e02, Ic03fbf6eb72b4ba1f9c3f2574faba44e470cf826" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790016 (https://phabricator.wikimedia.org/T188997) (owner: 10Stang) [07:15:47] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.292 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:15:51] (03CR) 10Ayounsi: [C: 03+2] "Tested on netbox-dev:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789857 (owner: 10Ayounsi) [07:15:53] (03PS2) 10KartikMistry: ULS entrypoint: Do not show current language, fix domain redirects [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789832 (https://phabricator.wikimedia.org/T307745) [07:16:50] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] provision_server: validate port number for non VC switches [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789857 (owner: 10Ayounsi) [07:16:57] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [07:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:23] (03Merged) 10jenkins-bot: provision_server: validate port number for non VC switches [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789857 (owner: 10Ayounsi) [07:17:54] (03PS6) 10Stang: Fix display issue of Timeline in cdo, gan, hak, wuu, yue and zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790016 (https://phabricator.wikimedia.org/T188997) [07:19:51] PROBLEM - PHP7 rendering on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:20:15] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [07:20:17] Hi, anyone could deploy in this window? [07:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:33] PROBLEM - PHP7 jobrunner on mw1439 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [07:20:34] koi: my patch's CI will take time :/ [07:20:52] I'm ready :) [07:22:15] OK. Seems no one else to deploy today. Let me check your patch. [07:22:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:23:42] (03CR) 10KartikMistry: [C: 03+2] "UTC afternoon backport." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790016 (https://phabricator.wikimedia.org/T188997) (owner: 10Stang) [07:24:24] koi: I'll update once patch is ready to test. [07:24:32] (03Merged) 10jenkins-bot: Fix display issue of Timeline in cdo, gan, hak, wuu, yue and zh_classical [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790016 (https://phabricator.wikimedia.org/T188997) (owner: 10Stang) [07:24:54] oh thanks! [07:25:29] koi: Please test it on mwdebug1001 [07:25:41] looking [07:26:35] (03CR) 10Hashar: [C: 03+2] [WMF] update Zuul plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/790261 (https://phabricator.wikimedia.org/T307621) (owner: 10Hashar) [07:26:54] (03PS1) 10Hashar: Update Zuul plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/790262 (https://phabricator.wikimedia.org/T307621) [07:27:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:28:26] (03CR) 10jerkins-bot: [V: 04-1] Update Zuul plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/790262 (https://phabricator.wikimedia.org/T307621) (owner: 10Hashar) [07:28:55] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/790262 (https://phabricator.wikimedia.org/T307621) (owner: 10Hashar) [07:29:17] koi: looks good? [07:29:39] still testing, it affects 7 sites... [07:29:53] oh right! [07:30:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:30:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:43] (03PS1) 10Giuseppe Lavagetto: Add node14, node16 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/790264 (https://phabricator.wikimedia.org/T306996) [07:31:13] kart_, test completed and everything looks great [07:31:21] RECOVERY - PHP7 rendering on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.219 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:31:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:35] koi: awesome. Deploying.. [07:31:55] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add node14, node16 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/790264 (https://phabricator.wikimedia.org/T306996) (owner: 10Giuseppe Lavagetto) [07:32:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27764 and previous config saved to /var/cache/conftool/dbconfig/20220509-073200-marostegui.json [07:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:05] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [07:34:07] (03Merged) 10jenkins-bot: [WMF] update Zuul plugin [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/790261 (https://phabricator.wikimedia.org/T307621) (owner: 10Hashar) [07:34:35] (03PS1) 10Muehlenhoff: sre.ganeti.reboot-vm: Add an option to skip waiting for successful Puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 [07:34:59] (03CR) 10Hashar: "recheck due to: rsync: link_stat "/git-fat/2e6a23935370b5571e1a8f92ae9e736f97514142" (in archiva) failed: No such file or directory (2)" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/790262 (https://phabricator.wikimedia.org/T307621) (owner: 10Hashar) [07:35:18] `servicechecker.CheckError: Generic connection error: HTTPSConnectionPool(host='mw1415.eqiad.wmnet', port=443): Max retries exceeded with url: /spec.yaml (Caused by ConnectTimeoutError(, 'Connection to mw1415.eqiad.wmnet timed out. (connect timeout=5)'))` [07:35:26] While deploying ^^ [07:35:44] 0 0 [07:37:10] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:790016|Fix display issue of Timeline in cdo, gan, hak, wuu, yue and zh_classical (T188997)]] (duration: 05m 13s) [07:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:14] kart_: that's been around since I did it earlier the day, we probably need to remove this from scap but I don't know how [07:37:15] T188997: Change EasyTimeline's font on yue Wikipedia with support for Chinese characters - https://phabricator.wikimedia.org/T188997 [07:37:33] Amir1: Thanks for update! [07:37:34] I hope pybal already depooled it [07:37:58] (03CR) 10Hashar: [C: 03+2] Update Zuul plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/790262 (https://phabricator.wikimedia.org/T307621) (owner: 10Hashar) [07:38:17] thanks everyone :) [07:38:34] (03Merged) 10jenkins-bot: Update Zuul plugin [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/790262 (https://phabricator.wikimedia.org/T307621) (owner: 10Hashar) [07:39:07] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.reboot-vm: Add an option to skip waiting for successful Puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 (owner: 10Muehlenhoff) [07:39:19] from SAL a couple days ago [07:39:22] 01:51 dzahn@cumin2002: conftool action : set/pooled=no; selector: dc=eqiad,name=mw1415.eqiad.wmnet [07:39:22] 01:50 dzahn@cumin2002: conftool action : set/pooled=no; selector: dc=codfw,name=mw1415.eqiad.wmnet [07:39:27] Amir1: I think you can unpool a mw server using `confctl` it still shows up at https://config-master.wikimedia.org/pybal/eqiad/apaches [07:39:35] { 'host': 'mw1415.eqiad.wmnet', 'weight':30, 'enabled': True } [07:39:36] they need to be set as pooled=inactive to be removed from scap too [07:40:05] hashar: how it's not spitting 500 left and right? [07:40:16] who knows [07:40:44] pybal health checking? [07:40:44] I can"t ssh to it [07:41:13] RECOVERY - PHP7 jobrunner on mw1439 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 3.126 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [07:41:18] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:41:38] 10SRE, 10serviceops, 10Patch-For-Review: Provide node14 and node16 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10Joe) I've build and published the `nodejs14-slim` and the `nodejs16-slim` images, using the nodejs package from the components. One importaant t... [07:41:46] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Joe) [07:41:49] 10SRE, 10serviceops, 10Patch-For-Review: Provide node14 and node16 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10Joe) 05Open→03Resolved [07:41:56] I'm going to depool it, finding the docs is quite fun [07:42:06] at least mw1415 is still in scap dsh groups [07:42:18] Amir1: on a cumin host, `sudo confctl select host=mw1415.eqiad.wmnet set/pooled=inactive` [07:42:23] (don't ask how I know) [07:42:26] maybe there is a runbook that does everything? [07:44:01] else there are a few commands listed at https://wikitech.wikimedia.org/wiki/Conftool#Pooling/depooling_a_server_from_all_the_related_services such as `confctl depool --hostname=mw1415.eqiad.wmnet` [07:44:14] it started setting everything inactive [07:44:15] codfw/maps/kartotherian/maps2010.codfw.wmnet: pooled changed yes => inactive [07:44:30] oh, name not host maybe? [07:44:45] change `set/pooled=inactive` to `get` to query the current status [07:44:46] now half of codfw is inactive [07:44:58] doesn't it ask for confirmation first? [07:45:23] it gave a massive thing, I thought it's just didn't diff [07:45:26] anyway [07:45:48] thankfully nothing in eqiad got depooled [07:46:06] Here is the full list [07:46:10] https://www.irccloud.com/pastebin/SFjYDPSD/ [07:46:18] (ProbeDown) firing: (27) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:46:19] (ProbeDown) firing: (29) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:46:23] that's me [07:46:30] PROBLEM - Host thumbor.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [07:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:46:57] ok how can I make this active? [07:47:29] (03PS2) 10Muehlenhoff: sre.ganeti.reboot-vm: Add an option to skip waiting for successful Puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 [07:47:32] <_joe_> Amir1: wtf havce you done? [07:47:45] <_joe_> what command you gave? [07:47:54] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.reboot-vm: Add an option to skip waiting for successful Puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 (owner: 10Muehlenhoff) [07:48:13] sudo confctl select host=mw1415.eqiad.wmnet set/pooled=inactive [07:48:13] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/search-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:13] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ldap-ro on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ldap-ro is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:13] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kartotherian-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kartotherian-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:13] PROBLEM - Confd template for /srv/config-master/pybal/esams/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:17] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.13:443, 208.80.153.232:80, 10.2.1.39:6443, 208.80.153.240:443, 10.2.1.54:443, 2620:0:860:ed1a::2:b:80, 208.80.153.225:443, 10.2.1.30:9643, 2620:0:860:ed1a::2:443, 2620:0:860:ed1a::2:80, 10.2.1.32:443, 2620:0:860:ed1a::2:b:443, 10.2.1.69:30443, 10.2.1.25:80, 10.2.1.27:80, 2620:0:860:ed1a::9:80, 10.2.1.32:80, 208.80.153.252 [07:48:17] 20:0:860:ed1a::3:fa:22, 10.2.1.44:443, 10.2.1.63:30443, 10.2.1.27:443, 10.2.1.30:9443, 10.2.1.1:443, 2620:0:860:ed1a::1:80, 10.2.1.26:443, 10.2.1.43:443, 208.80.153.224:443, 208.80.153.250:22, 208.80.153.240:80, 10.2.1.72:6443, 208.80.153.252:636, 208.80.153.224:80, 10.2.1.5:443, 208.80.153.232:443, 10.2.1.13:6533, 10.2.1.30:9200, 10.2.1.10:443, 2620:0:860:ed1a::1:443, 10.2.1.41:80, 10.2.1.32:8888, 10.2.1.24:8800, 10.2.1.30:9243, 10.2.1.5 [07:48:17] 620:0:860:ed1a::9:443, 208.80.153.225:80]) https://wikitech.wikimedia.org/wiki/PyBal [07:48:21] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/k8s-ingress-staging on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/k8s-ingress-staging is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:22] how can I bring it back [07:48:22] PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.32 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:48:23] PROBLEM - LVS k8s-ingress-staging codfw port 30443/tcp - istio-ingresscontroller on kubernetes staging. k8s-ingress-staging.svc.codfw.wmnet IPv4 on k8s-ingress-staging.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.69 and port 30443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:48:26] PROBLEM - LVS swift codfw port 80/tcp - Swift media storage IPv4 #page on ms-fe.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.27 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:48:26] Amir1: set/pooled=yes [07:48:27] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:29] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [07:48:31] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:48:31] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:31] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.63:30443, 10.2.1.39:6443, 10.2.1.13:443, 10.2.1.30:9643, 10.2.1.32:443, 10.2.1.69:30443, 10.2.1.25:80, 10.2.1.32:80, 10.2.1.44:443, 10.2.1.27:443, 10.2.1.30:9443, 10.2.1.1:443, 10.2.1.26:443, 10.2.1.43:443, 10.2.1.54:443, 10.2.1.72:6443, 10.2.1.5:443, 10.2.1.13:6533, 10.2.1.30:9200, 10.2.1.10:443, 10.2.1.41:80, 10.2.1.32: [07:48:31] .2.1.24:8800, 10.2.1.30:9243, 10.2.1.53:443, 10.2.1.27:80]) https://wikitech.wikimedia.org/wiki/PyBal [07:48:33] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:33] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:33] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:33] want to [07:48:33] what on earth is going on ? [07:48:35] PROBLEM - Confd template for /srv/config-master/pybal/codfw/jobrunner on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/jobrunner is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:38] oh god [07:48:40] PROBLEM - LVS upload-https codfw port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.codfw.wikimedia.org is CRITICAL: connect to address 208.80.153.240 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:48:41] <_joe_> no idea how to [07:48:41] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kartotherian-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kartotherian-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:41] PROBLEM - Confd template for /srv/config-master/pybal/codfw/swift on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:41] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/appservers-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/appservers-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:41] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:41] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:42] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ldap-ro on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ldap-ro is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:42] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search-psi-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/search-psi-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:43] sudo confctl select codfw/ldap-ro/ldap-ro/ldap-replica2005.wikimedia.org set/pooled=yes [07:48:43] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:43] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:48:44] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print [07:48:44] page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Proton [07:48:44] ? [07:48:45] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/search is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:47] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:47] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ml-ctrl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ml-ctrl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ldap-ro-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ldap-ro-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:49] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:48:50] Amir1: yes [07:48:51] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:51] PROBLEM - Confd template for /srv/config-master/pybal/codfw/prometheus on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/prometheus is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:51] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:51] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/search-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:51] PROBLEM - Confd template for /srv/config-master/pybal/esams/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:54] PROBLEM - LVS upload codfw port 80/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::2:b and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:48:54] PROBLEM - LVS upload codfw port 80/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.codfw.wikimedia.org is CRITICAL: connect to address 208.80.153.240 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:48:55] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:55] PROBLEM - LVS text codfw port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: connect to address 208.80.153.224 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:48:57] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:48:59] PROBLEM - LVS schema codfw port 443/tcp - Event Schema HTTP service IPv4 on schema.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.43 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:48:59] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thanos-swift on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thanos-swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:00] PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::1 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:49:01] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:01] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/k8s-ingress-staging on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/k8s-ingress-staging is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:09] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:49:09] ValueError: not enough values to unpack (expected 2, got 1) [07:49:11] how can it be that everything is broken? [07:49:13] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kartotherian on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kartotherian is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:13] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thumbor on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thumbor is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:13] PROBLEM - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:15] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ml-staging-ctrl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ml-staging-ctrl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:15] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/thanos-query on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/thanos-query is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:15] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:15] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:15] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ores on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ores is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:16] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [07:49:18] <_joe_> Amir1: just run the same command with set/pooled=yes [07:49:18] PROBLEM - LVS upload-https codfw port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::2:b and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:49:20] PROBLEM - LVS prometheus codfw port 80/tcp - Prometheus monitoring IPv4 #page on prometheus.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.25 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:49:23] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kartotherian on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kartotherian is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:23] PROBLEM - Confd template for /srv/config-master/pybal/codfw/swift-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/swift-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:25] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:25] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:25] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/search is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:25] PROBLEM - Confd template for /srv/config-master/pybal/codfw/videoscaler on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/videoscaler is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:25] PROBLEM - Confd template for /srv/config-master/pybal/esams/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:26] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:29] ladsgroup@cumin1001:~$ sudo confctl select codfw/ldap-ro/ldap-ro/ldap-replica2005.wikimedia.org set/pooled=yes [07:49:29] Traceback (most recent call last): [07:49:29] File "/usr/bin/confctl", line 33, in [07:49:29] sys.exit(load_entry_point('conftool==2.1.3', 'console_scripts', 'confctl')()) [07:49:29] File "/usr/lib/python3/dist-packages/conftool/cli/tool.py", line 343, in main [07:49:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs-internal on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs-internal is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:29] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:29] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search-psi-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/search-psi-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:29] PROBLEM - Confd template for /srv/config-master/pybal/esams/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:29] cli = ToolCliByLabel(args) [07:49:29] File "/usr/lib/python3/dist-packages/conftool/cli/tool.py", line 179, in __init__ [07:49:30] self.parse_selectors() [07:49:30] File "/usr/lib/python3/dist-packages/conftool/cli/tool.py", line 183, in parse_selectors [07:49:31] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs-heavy-queries on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs-heavy-queries is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:31] k, expr = tag.split('=', 1) [07:49:31] ValueError: not enough values to unpack (expected 2, got 1) [07:49:32] doesn't work [07:49:33] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:33] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:35] PROBLEM - Confd template for /srv/config-master/pybal/codfw/swift-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/swift-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:35] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:35] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ores on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ores is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:35] PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:35] PROBLEM - Confd template for /srv/config-master/pybal/codfw/k8s-ingress-staging on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/k8s-ingress-staging is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:36] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/docker-registry on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/docker-registry is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:38] PROBLEM - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::9 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:49:39] It's all codfw [07:49:39] PROBLEM - LVS wdqs-internal codfw port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.41 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:49:39] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ml-staging-ctrl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ml-staging-ctrl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:39] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/jobrunner on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/jobrunner is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:39] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/swift on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:39] PROBLEM - Confd template for /srv/config-master/pybal/codfw/videoscaler on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/videoscaler is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:40] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:40] PROBLEM - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:41] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:41] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:42] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/prometheus on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/prometheus is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:42] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ores on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ores is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:43] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs-heavy-queries on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs-heavy-queries is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:43] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:49:44] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/inference on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:44] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/prometheus on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/prometheus is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:45] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/thanos-swift on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/thanos-swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:45] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:46] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:46] PROBLEM - Confd template for /srv/config-master/pybal/esams/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/docker-registry on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/docker-registry is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs-heavy-queries on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs-heavy-queries is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:48] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:48] PROBLEM - Docker registry HTTPS interface on registry2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2003.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 362 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Docker [07:49:49] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:49] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:50] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:50] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:51] PROBLEM - Confd template for /srv/config-master/pybal/codfw/k8s-ingress-staging on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/k8s-ingress-staging is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:51] PROBLEM - Confd template for /srv/config-master/pybal/codfw/appservers-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/appservers-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:52] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kartotherian-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kartotherian-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:52] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:53] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/thanos-query on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/thanos-query is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:53] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:54] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/schema on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/schema is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:54] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:55] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kartotherian on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kartotherian is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:55] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:56] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:56] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:57] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ml-ctrl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ml-ctrl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:59] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/swift-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/swift-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:59] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:49:59] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search-psi-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/search-psi-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:01] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:01] PROBLEM - Confd template for /srv/config-master/pybal/codfw/swift on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:01] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/thumbor on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/thumbor is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:01] PROBLEM - Confd template for /srv/config-master/pybal/codfw/docker-registry on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/docker-registry is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:01] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/jobrunner on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/jobrunner is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:02] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:03] there's too much noiose here, discussion in _security?? [07:50:03] PROBLEM - LVS ncredir codfw port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::9 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:04] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ldap-ro-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ldap-ro-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:05] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:05] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ldap-ro-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ldap-ro-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:05] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/search is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:05] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/videoscaler on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/videoscaler is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:05] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:06] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:06] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thanos-swift on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thanos-swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:09] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:09] PROBLEM - Confd template for /srv/config-master/pybal/codfw/schema on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/schema is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:09] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:09] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thanos-query on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thanos-query is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:09] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search-omega-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/search-omega-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:11] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/thanos-swift on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/thanos-swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:11] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:11] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=api-appserver,dc=codfw [07:50:11] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:11] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search-omega-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/search-omega-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:12] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:12] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ldap-ro on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ldap-ro is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:15] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:15] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:50:18] PROBLEM - LVS ncredir codfw port 80/tcp - Non canonical domains redirect service IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: connect to address 208.80.153.232 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:18] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/search is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:21] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:22] PROBLEM - LVS ores codfw port 443/tcp - Objective Revision Evaluation Service. ores.svc.codfw.wmnet IPv4 #page on ores.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.10 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:22] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:22] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ml-ctrl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ml-ctrl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:22] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:23] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/inference on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/inference is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:23] PROBLEM - Confd template for /srv/config-master/pybal/esams/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:23] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:24] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs-ssl on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:24] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs-internal on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs-internal is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:25] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:25] PROBLEM - Confd template for /srv/config-master/pybal/codfw/schema on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/schema is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:26] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thanos-query on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thanos-query is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:26] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:27] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/thumbor on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/thumbor is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:27] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search-psi-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/search-psi-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:28] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:28] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/videoscaler on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/videoscaler is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kartotherian-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kartotherian-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:29] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/text on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/schema on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/schema is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:30] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry2004.codfw.wmnet:443/v2/wikimedia-stretch/manifests/latest - 362 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Docker [07:50:31] PROBLEM - Docker registry health on registry2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 379 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Docker [07:50:32] PROBLEM - LVS thanos-query codfw port 443/tcp - Prometheus long-term storage- query service IPv4 #page on thanos-query.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.53 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:32] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs-heavy-queries on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs-heavy-queries is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:32] PROBLEM - Confd template for /srv/config-master/pybal/codfw/jobrunner on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/jobrunner is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:33] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search-omega-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/search-omega-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:33] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/appservers-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/appservers-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:34] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:34] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:35] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ldap-ro on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ldap-ro is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:35] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:36] PROBLEM - Confd template for /srv/config-master/pybal/codfw/appservers-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/appservers-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:36] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/search-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:37] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ldap-ro-ssl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ldap-ro-ssl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:37] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ml-ctrl on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ml-ctrl is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:38] PROBLEM - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: connect to address 208.80.153.232 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:38] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/upload is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:39] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:39] PROBLEM - Confd template for /srv/config-master/pybal/codfw/prometheus on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/prometheus is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:40] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/swift on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/swift is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:40] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/text on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/text is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:41] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/text-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:41] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:42] PROBLEM - Confd template for /srv/config-master/pybal/codfw/docker-registry on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/docker-registry is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:42] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thumbor on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/thumbor is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:43] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:43] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/search-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:44] PROBLEM - LVS swift-https codfw port 443/tcp - Swift media storage IPv4 #page on ms-fe.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.27 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:44] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:45] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:45] PROBLEM - LVS jobrunner codfw port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.codfw.wmnet IPv4 #page on jobrunner.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.26 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:46] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kartotherian on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/kartotherian is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:46] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:47] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs-internal on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs-internal is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:47] PROBLEM - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/text-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:48] PROBLEM - LVS text codfw port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:860:ed1a::1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:48] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search-omega-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/search-omega-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:49] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ores on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ores is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:49] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:50] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs-internal on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs-internal is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:50:50] PROBLEM - LVS thanos-swift codfw port 443/tcp - Prometheus long-term storage- object storage -swift- access IPv4 #page on thanos-swift.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.54 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:54] PROBLEM - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: connect to address 208.80.153.224 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:50:57] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:51:03] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw [07:51:05] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/upload-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:51:05] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/swift-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/swift-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:51:05] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [07:51:05] PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:10] RECOVERY - LVS swift codfw port 80/tcp - Swift media storage IPv4 #page on ms-fe.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:51:13] PROBLEM - PyBal IPVS diff check on lvs2007 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::9:443, 208.80.153.232:443, 2620:0:860:ed1a::9:80, 208.80.153.232:80]) https://wikitech.wikimedia.org/wiki/PyBal [07:51:18] PROBLEM - LVS search-omega-https codfw port 9443/tcp - Elasticsearch search for MediaWiki -Omega cluster- - HTTPS IPv4 #page on search.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.30 and port 9443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:51:24] RECOVERY - LVS upload-https codfw port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 1455 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:51:25] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 5.225 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:51:25] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:51:33] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [07:51:38] RECOVERY - LVS upload codfw port 80/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 492 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:51:39] RECOVERY - LVS upload codfw port 80/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 479 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:51:40] RECOVERY - LVS text codfw port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 610 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:51:45] RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18988 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:51:45] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:51:45] RECOVERY - LVS schema codfw port 443/tcp - Event Schema HTTP service IPv4 on schema.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 482 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:52:03] RECOVERY - LVS upload-https codfw port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 1467 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:52:04] RECOVERY - LVS prometheus codfw port 80/tcp - Prometheus monitoring IPv4 #page on prometheus.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 10959 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:52:04] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:52:15] RECOVERY - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 233 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:52:21] RECOVERY - LVS wdqs-internal codfw port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:52:30] RECOVERY - Docker registry HTTPS interface on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 3791 bytes in 0.336 second response time https://wikitech.wikimedia.org/wiki/Docker [07:52:42] PROBLEM - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 4.201e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12 [07:52:42] RECOVERY - mediawiki-installation DSH group on mw2414 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:52:45] RECOVERY - LVS ncredir codfw port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:52:59] RECOVERY - LVS ncredir codfw port 80/tcp - Non canonical domains redirect service IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:05] RECOVERY - LVS ores codfw port 443/tcp - Objective Revision Evaluation Service. ores.svc.codfw.wmnet IPv4 #page on ores.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 8589 bytes in 1.188 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:05] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:12] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3791 bytes in 0.925 second response time https://wikitech.wikimedia.org/wiki/Docker [07:53:12] RECOVERY - Docker registry health on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker [07:53:15] RECOVERY - Host thumbor.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 33.04 ms [07:53:15] RECOVERY - LVS thanos-query codfw port 443/tcp - Prometheus long-term storage- query service IPv4 #page on thanos-query.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 1.140 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:16] RECOVERY - LVS ncredir-https codfw port 443/tcp - Non canonical redirect service IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 233 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:23] RECOVERY - LVS swift-https codfw port 443/tcp - Swift media storage IPv4 #page on ms-fe.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:27] RECOVERY - LVS jobrunner codfw port 443/tcp - JobRunner LVS interface -https-. jobrunner.svc.codfw.wmnet IPv4 #page on jobrunner.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 398 bytes in 1.211 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:31] RECOVERY - LVS text codfw port 80/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -Varnish- IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 623 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:32] RECOVERY - LVS thanos-swift codfw port 443/tcp - Prometheus long-term storage- object storage -swift- access IPv4 #page on thanos-swift.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 1.140 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:32] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:53:37] RECOVERY - LVS text-https codfw port 443/tcp - Main wiki platform LVS service- text.eqiad.wikimedia.org -nginx- IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18976 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:37] RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:53:40] ^ those are being handled in the private channel [07:53:46] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [07:53:48] RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 482 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:50] RECOVERY - LVS k8s-ingress-staging codfw port 30443/tcp - istio-ingresscontroller on kubernetes staging. k8s-ingress-staging.svc.codfw.wmnet IPv4 on k8s-ingress-staging.svc.codfw.wmnet is OK: TCP OK - 0.032 second response time on 10.2.1.69 port 30443 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:53:59] RECOVERY - LVS search-omega-https codfw port 9443/tcp - Elasticsearch search for MediaWiki -Omega cluster- - HTTPS IPv4 #page on search.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 683 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [07:54:52] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:55:04] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:55:34] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:10] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:57:32] RECOVERY - PyBal IPVS diff check on lvs2007 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [07:57:47] (JobUnavailable) firing: (7) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:59:39] (ProbeDown) firing: (31) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:00:12] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:58] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in codfw is experiencing abnormal failure rates - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [08:02:18] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw1415.eqiad.wmnet [08:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:39] (03CR) 10Muehlenhoff: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 (owner: 10Muehlenhoff) [08:03:39] Running it correctly this time [08:03:42] https://www.irccloud.com/pastebin/vSO5u9JU/ [08:03:52] !log ladsgroup@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1415.eqiad.wmnet [08:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:50] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27765 and previous config saved to /var/cache/conftool/dbconfig/20220509-080521-marostegui.json [08:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:26] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [08:07:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:07:55] (03PS1) 10Hashar: Revert "No longer install subversion on Phabricator hosts" [puppet] - 10https://gerrit.wikimedia.org/r/789958 (https://phabricator.wikimedia.org/T307889) [08:09:20] !log temp stop tegola-swift-container delete - T307184 [08:09:20] (JobUnavailable) firing: (7) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:25] T307184: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 [08:09:33] (03PS3) 10Muehlenhoff: sre.ganeti.reboot-vm: Add an option to skip waiting for successful Puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 [08:09:39] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:10:38] (ProbeDown) resolved: (29) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:10:44] (ProbeDown) resolved: (29) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:12:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I'll let you merge, Brett" [puppet] - 10https://gerrit.wikimedia.org/r/789881 (owner: 10BCornwall) [08:12:27] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:12:45] (ProbeDown) firing: (30) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:13:01] (ProbeDown) resolved: (30) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:15:17] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove high NEL alert, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/789152 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:15:24] (03PS2) 10Filippo Giunchedi: prometheus: remove high NEL alert, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/789152 (https://phabricator.wikimedia.org/T305847) [08:19:46] RECOVERY - mediawiki-installation DSH group on mw2418 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:20:48] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:20:48] RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:20:48] RECOVERY - Confd template for /srv/config-master/pybal/codfw/k8s-ingress-staging on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:20:48] RECOVERY - Confd template for /srv/config-master/pybal/codfw/jobrunner on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:20:48] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ldap-ro on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:21:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM prometheus4001.ulsfo.wmnet [08:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:22] RECOVERY - Confd template for /srv/config-master/pybal/codfw/search on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:22:24] RECOVERY - Confd template for /srv/config-master/pybal/codfw/search-psi-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:22:30] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ores on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:22:44] !log restarting confd on puppetmaster100[12] [08:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:50] RECOVERY - Confd template for /srv/config-master/pybal/codfw/docker-registry on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:23:53] 10SRE, 10SRE-swift-storage, 10Commons: Server error 0 after uploading chunk - https://phabricator.wikimedia.org/T307874 (10Peachey88) [08:24:20] RECOVERY - mediawiki-installation DSH group on mw2417 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:24:54] PROBLEM - puppet last run on an-master1001 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:24:56] RECOVERY - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:24:56] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wdqs-heavy-queries on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:24:56] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/docker-registry on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:24:56] RECOVERY - Confd template for /srv/config-master/pybal/esams/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:24:58] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:00] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:00] RECOVERY - Confd template for /srv/config-master/pybal/codfw/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:02] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wdqs-ssl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:02] RECOVERY - Confd template for /srv/config-master/pybal/codfw/k8s-ingress-staging on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:02] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:02] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kartotherian-ssl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:02] RECOVERY - Confd template for /srv/config-master/pybal/codfw/appservers-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:02] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/thanos-query on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:02] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:03] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:03] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/schema on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:10] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ml-ctrl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:14] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ldap-ro-ssl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:16] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:16] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/videoscaler on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:16] RECOVERY - Confd template for /srv/config-master/pybal/codfw/thanos-swift on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:16] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:16] RECOVERY - Confd template for /srv/config-master/pybal/codfw/search on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:20] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/thanos-swift on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:20] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:22] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/search-omega-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:22] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:22] RECOVERY - Confd template for /srv/config-master/pybal/codfw/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:22] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ldap-ro on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:28] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/search on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:30] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:30] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:30] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ml-ctrl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:30] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:30] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/inference on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:30] RECOVERY - Confd template for /srv/config-master/pybal/esams/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:30] RECOVERY - Confd template for /srv/config-master/pybal/codfw/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:31] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wdqs-ssl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:34] RECOVERY - Confd template for /srv/config-master/pybal/codfw/schema on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:34] RECOVERY - Confd template for /srv/config-master/pybal/codfw/thanos-query on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:34] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/search-psi-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:34] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/thumbor on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:34] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:34] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:40] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:42] RECOVERY - Confd template for /srv/config-master/pybal/codfw/docker-registry on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:42] RECOVERY - Confd template for /srv/config-master/pybal/codfw/thumbor on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:42] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wdqs on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:44] RECOVERY - Confd template for /srv/config-master/pybal/codfw/search-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:48] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kartotherian on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:50] RECOVERY - Confd template for /srv/config-master/pybal/esams/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:50] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wdqs-internal on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:50] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:52] RECOVERY - Confd template for /srv/config-master/pybal/codfw/search-omega-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:52] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wdqs-internal on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:52] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:52] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ores on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:25:59] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jbond) Also see {F35119049} [08:26:06] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:06] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/k8s-ingress-staging on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:06] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/swift-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:06] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wdqs on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:18] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:20] RECOVERY - Confd template for /srv/config-master/pybal/codfw/jobrunner on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:24] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kartotherian-ssl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:24] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:24] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/appservers-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:24] RECOVERY - Confd template for /srv/config-master/pybal/codfw/swift on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:24] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:24] RECOVERY - Confd template for /srv/config-master/pybal/codfw/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:24] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ldap-ro on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:25] RECOVERY - Confd template for /srv/config-master/pybal/codfw/search-psi-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:30] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ldap-ro-ssl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:34] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:34] RECOVERY - Confd template for /srv/config-master/pybal/codfw/prometheus on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:34] RECOVERY - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:34] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/search-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:34] RECOVERY - Confd template for /srv/config-master/pybal/esams/upload-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM prometheus4001.ulsfo.wmnet [08:26:38] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:56] (03CR) 10JMeybohm: [C: 03+1] scaffold: fix issue where volumes will be folded into comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/789850 (owner: 10Hnowlan) [08:27:00] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kartotherian on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:00] RECOVERY - Confd template for /srv/config-master/pybal/codfw/swift-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:02] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:02] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:10] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jbond) @jhathaway wonder if anything may have changed recently [08:27:14] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/upload on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:14] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/swift on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:14] RECOVERY - Confd template for /srv/config-master/pybal/codfw/videoscaler on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:14] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/jobrunner on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:14] RECOVERY - Confd template for /srv/config-master/pybal/codfw/inference on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:14] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ml-staging-ctrl on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:14] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:15] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/prometheus on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:15] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:16] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wdqs-heavy-queries on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:27:16] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ores on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [08:29:25] (03CR) 10Jaime Nuche: [C: 04-1] scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [08:29:43] (03PS1) 10Filippo Giunchedi: hieradata: set ncredir as non-paging [puppet] - 10https://gerrit.wikimedia.org/r/790269 (https://phabricator.wikimedia.org/T291946) [08:30:03] !log restarting blazegraph on wdqs1004 (BlazegraphFreeAllocatorsDecreasingRapidly) [08:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:12] RECOVERY - puppet last run on an-master1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:32:31] (03CR) 10Muehlenhoff: scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [08:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:40:04] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=mw2412.codfw.wmnet [08:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:30] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=ores2002.codfw.wmnet [08:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:45] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in edge sites to a fixed KVM machine type - https://phabricator.wikimedia.org/T307423 (10MoritzMuehlenhoff) [08:40:47] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:41:00] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in ulsfo to a fixed KVM machine type - https://phabricator.wikimedia.org/T307425 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [08:41:05] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: name=elastic2033.codfw.wmnet [08:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM prometheus5001.eqsin.wmnet [08:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:35] !log hashar@deploy1002 Started deploy [gerrit/gerrit@94c5028]: Update Zuul plugin - T307621 [08:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:39] T307621: Zuul Depends-On footer processing missing from Gerrit UI after 3.4.4 upgrade - https://phabricator.wikimedia.org/T307621 [08:41:44] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@94c5028]: Update Zuul plugin - T307621 (duration: 00m 09s) [08:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:25] !log Restarting Gerrit on replica gerrit2001.wikimedia.org to update the Zuul plugin # T307621 [08:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:48] !log hashar@deploy1002 Started deploy [gerrit/gerrit@94c5028]: Update Zuul plugin - T307621 [08:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:55] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@94c5028]: Update Zuul plugin - T307621 (duration: 00m 07s) [08:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:04] hi folks, I wanted to gently nudge this story to try to get it some attention: https://phabricator.wikimedia.org/T307351 [08:45:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM prometheus5001.eqsin.wmnet [08:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:01] !log Restarting Gerrit for plugin update [08:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:05] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10ayounsi) a:03ayounsi [08:47:50] RECOVERY - mediawiki-installation DSH group on mw2416 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:51:46] !log Gerrit is back and operational [08:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:06] !log mw241[2-9]: scap pull [08:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:13] (03CR) 10Jaime Nuche: [C: 04-1] scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [08:58:08] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 (10MoritzMuehlenhoff) [09:02:24] (03Abandoned) 10Matthias Mullie: Remove fulltext normalisation in synonyms profile for performance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753722 (https://phabricator.wikimedia.org/T293106) (owner: 10Cparle) [09:05:31] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jbond) I have looked in our logs and the following is an example of what we see on our side ` 2022-05-06 01:57:31 H=mail-lf1-x12b.google.com [2... [09:06:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Good starting point, but it definitely needs improvements:" [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:08:13] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in eqsin to a fixed KVM machine type - https://phabricator.wikimedia.org/T307426 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [09:08:15] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in edge sites to a fixed KVM machine type - https://phabricator.wikimedia.org/T307423 (10MoritzMuehlenhoff) [09:09:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM prometheus3001.esams.wmnet [09:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM prometheus3001.esams.wmnet [09:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:15] (03CR) 10Vgutierrez: [C: 03+1] hieradata: set ncredir as non-paging [puppet] - 10https://gerrit.wikimedia.org/r/790269 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:16:19] (03PS1) 10Elukey: kubernetes: allow deploy-ml-service users to check pods on ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/790288 [09:17:36] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10MoritzMuehlenhoff) [09:17:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35132/console" [puppet] - 10https://gerrit.wikimedia.org/r/790288 (owner: 10Elukey) [09:17:59] (03CR) 10Elukey: kubernetes: allow deploy-ml-service users to check pods on ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/790288 (owner: 10Elukey) [09:21:14] (03Abandoned) 10Slyngshede: Replace crontab with systemd timers for Postgresql dump [puppet] - 10https://gerrit.wikimedia.org/r/789677 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:21:54] (03PS1) 10Btullis: Increase the connect_timeout for eventgate based services [deployment-charts] - 10https://gerrit.wikimedia.org/r/790289 [09:22:31] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [09:23:47] (03CR) 10Hnowlan: [C: 03+2] scaffold: fix issue where volumes will be folded into comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/789850 (owner: 10Hnowlan) [09:24:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ping3002.esams.wmnet [09:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:22] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2052.codfw.wmnet with OS bullseye [09:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:29] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2052.codfw.wmnet with OS bullseye [09:26:09] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set ncredir as non-paging [puppet] - 10https://gerrit.wikimedia.org/r/790269 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:28:14] (03PS1) 10Filippo Giunchedi: prometheus: add 'timeout' override for service::catalog probes [puppet] - 10https://gerrit.wikimedia.org/r/790291 (https://phabricator.wikimedia.org/T291946) [09:28:16] (03PS1) 10Filippo Giunchedi: hieradata: set thumbor probe timeout [puppet] - 10https://gerrit.wikimedia.org/r/790292 (https://phabricator.wikimedia.org/T291946) [09:29:13] (03Merged) 10jenkins-bot: scaffold: fix issue where volumes will be folded into comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/789850 (owner: 10Hnowlan) [09:29:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ping3002.esams.wmnet [09:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27768 and previous config saved to /var/cache/conftool/dbconfig/20220509-093032-marostegui.json [09:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:37] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [09:32:59] (03PS1) 10AikoChou: ml-services: update editquality and draftquality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/790293 (https://phabricator.wikimedia.org/T301766) [09:35:06] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I suggest that we increase the `connect_timeout` value for the `local_service` clust... [09:35:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir3001.esams.wmnet [09:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:59] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [09:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:07] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 08s) [09:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:21] (03PS2) 10Filippo Giunchedi: prometheus: add 'timeout' override for service::catalog probes [puppet] - 10https://gerrit.wikimedia.org/r/790291 (https://phabricator.wikimedia.org/T291946) [09:37:23] (03PS2) 10Filippo Giunchedi: hieradata: set thumbor probe timeout [puppet] - 10https://gerrit.wikimedia.org/r/790292 (https://phabricator.wikimedia.org/T291946) [09:38:01] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [09:38:02] (03PS9) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [09:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:18] (03CR) 10Klausman: [C: 03+1] "I think a separate group is the right thing for now. If we later find that having wider access is desirable, we can always fold it back in" [puppet] - 10https://gerrit.wikimedia.org/r/790288 (owner: 10Elukey) [09:38:33] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 32s) [09:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:45] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [09:40:14] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:41:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir3001.esams.wmnet [09:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:57] (03PS10) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [09:41:59] (03PS1) 10Jbond: rake_modules: rafactor git helper and add new_files [puppet] - 10https://gerrit.wikimedia.org/r/790294 [09:42:25] !log elukey@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [09:42:26] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:30] !log elukey@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 05s) [09:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:42] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: rafactor git helper and add new_files [puppet] - 10https://gerrit.wikimedia.org/r/790294 (owner: 10Jbond) [09:42:46] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [09:43:07] (03Abandoned) 10Ayounsi: Update requirements and artifacts for bullseye [software/netbox-deploy] (2-10-4-bullseye) - 10https://gerrit.wikimedia.org/r/789596 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:43:23] (03PS11) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [09:43:26] RECOVERY - ores_workers_running on ores2001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [09:44:09] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [09:44:34] RECOVERY - ores on ores2001 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [09:44:51] (03PS2) 10Jbond: rake_modules: rafactor git helper and add new_files [puppet] - 10https://gerrit.wikimedia.org/r/790294 [09:45:23] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: rafactor git helper and add new_files [puppet] - 10https://gerrit.wikimedia.org/r/790294 (owner: 10Jbond) [09:47:18] (03CR) 10Gehel: [C: 03+1] [Beta Cluster] LabsServices: Switch elastic hosts to bullseye hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789877 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [09:47:54] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1001/35134/" [puppet] - 10https://gerrit.wikimedia.org/r/790292 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:48:24] (03CR) 10Filippo Giunchedi: "PCC (while setting 'timeout' for thumbor in the next review, to demo this feature) https://puppet-compiler.wmflabs.org/pcc-worker1001/3513" [puppet] - 10https://gerrit.wikimedia.org/r/790291 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:49:22] (03CR) 10Jbond: "thanks see inline" [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [09:49:27] (03PS12) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [09:50:08] (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:50:40] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:50:48] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [09:51:25] (03Abandoned) 10Majavah: policies/cr-labs: Allow tftp to install servers [homer/public] - 10https://gerrit.wikimedia.org/r/769508 (https://phabricator.wikimedia.org/T303296) (owner: 10Majavah) [09:52:26] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2052.codfw.wmnet with reason: host reimage [09:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:04] (03PS3) 10Jbond: rake_modules: rafactor git helper and add new_files [puppet] - 10https://gerrit.wikimedia.org/r/790294 [09:55:18] (03PS13) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [09:55:28] (03PS6) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [09:55:51] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2052.codfw.wmnet with reason: host reimage [09:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:21] (03PS7) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [09:57:23] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:57:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:00:45] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:01:31] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:01:43] (03CR) 10JMeybohm: [C: 03+1] "Oh yeah - sorry. We recently changed the default group from wikidev to deployment in the course of https://phabricator.wikimedia.org/T3057" [puppet] - 10https://gerrit.wikimedia.org/r/790288 (owner: 10Elukey) [10:02:41] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:02:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir3002.esams.wmnet [10:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:09] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:03:46] jouncebot: nowandnext [10:03:46] No deployments scheduled for the next 2 hour(s) and 56 minute(s) [10:03:46] In 2 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T1300) [10:04:25] (03PS7) 10MarcoAurelio: Enable $wgFixDoubleRedirects on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) [10:06:21] RECOVERY - ores_workers_running on ores2002 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [10:07:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir3002.esams.wmnet [10:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:54] (03PS1) 10Elukey: ores: refactor git setup and add settings for Buster [puppet] - 10https://gerrit.wikimedia.org/r/790297 (https://phabricator.wikimedia.org/T303801) [10:14:27] (03CR) 10jerkins-bot: [V: 04-1] ores: refactor git setup and add settings for Buster [puppet] - 10https://gerrit.wikimedia.org/r/790297 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [10:15:27] (03PS2) 10Btullis: Increase the connect_timeout for eventgate based services [deployment-charts] - 10https://gerrit.wikimedia.org/r/790289 [10:15:41] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10MoritzMuehlenhoff) [10:15:46] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:17:40] (03PS1) 10Vgutierrez: mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) [10:19:46] (03CR) 10jerkins-bot: [V: 04-1] mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) (owner: 10Vgutierrez) [10:20:32] (03PS12) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) [10:22:03] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:22:33] RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:27] (03PS2) 10Vgutierrez: mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) [10:26:11] (03CR) 10jerkins-bot: [V: 04-1] mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) (owner: 10Vgutierrez) [10:26:41] (03CR) 10Vgutierrez: [C: 03+1] "Thanks ❤️" [puppet] - 10https://gerrit.wikimedia.org/r/789188 (owner: 10Jbond) [10:27:18] (03PS1) 10Jbond: C:postgresql: rename wal_keep_segments to wal_keep_size [puppet] - 10https://gerrit.wikimedia.org/r/790301 (https://phabricator.wikimedia.org/T296452) [10:27:52] (03CR) 10jerkins-bot: [V: 04-1] C:postgresql: rename wal_keep_segments to wal_keep_size [puppet] - 10https://gerrit.wikimedia.org/r/790301 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:28:25] (03PS2) 10Jbond: C:postgresql: rename wal_keep_segments to wal_keep_size [puppet] - 10https://gerrit.wikimedia.org/r/790301 (https://phabricator.wikimedia.org/T296452) [10:28:58] (03CR) 10jerkins-bot: [V: 04-1] C:postgresql: rename wal_keep_segments to wal_keep_size [puppet] - 10https://gerrit.wikimedia.org/r/790301 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:29:12] (03PS2) 10Elukey: ores: refactor git setup and add settings for Buster [puppet] - 10https://gerrit.wikimedia.org/r/790297 (https://phabricator.wikimedia.org/T303801) [10:30:17] (03PS3) 10Jbond: C:postgresql: rename wal_keep_segments to wal_keep_size [puppet] - 10https://gerrit.wikimedia.org/r/790301 (https://phabricator.wikimedia.org/T296452) [10:30:30] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35137/console" [puppet] - 10https://gerrit.wikimedia.org/r/790297 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [10:30:35] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2052.codfw.wmnet with OS bullseye [10:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:41] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2052.codfw.wmnet with OS bullseye completed: - ms-be2052 (**PASS**) - Downtim... [10:30:47] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10MoritzMuehlenhoff) [10:31:08] (03PS3) 10Elukey: ores: refactor git setup and add settings for Buster [puppet] - 10https://gerrit.wikimedia.org/r/790297 (https://phabricator.wikimedia.org/T303801) [10:31:09] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10MoritzMuehlenhoff) [10:31:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35138/console" [puppet] - 10https://gerrit.wikimedia.org/r/790301 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:32:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35139/console" [puppet] - 10https://gerrit.wikimedia.org/r/790297 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [10:34:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install6001.wikimedia.org [10:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:46] (03CR) 10Klausman: [C: 03+1] ores: refactor git setup and add settings for Buster [puppet] - 10https://gerrit.wikimedia.org/r/790297 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [10:36:07] (03CR) 10Elukey: [V: 03+1 C: 03+2] ores: refactor git setup and add settings for Buster [puppet] - 10https://gerrit.wikimedia.org/r/790297 (https://phabricator.wikimedia.org/T303801) (owner: 10Elukey) [10:37:15] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:03] (03CR) 10Jbond: [C: 03+2] C:raid: update hpsa logic to install ssacli tools on > buster [puppet] - 10https://gerrit.wikimedia.org/r/789240 (https://phabricator.wikimedia.org/T306354) (owner: 10Jbond) [10:38:09] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [10:39:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install6001.wikimedia.org [10:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:40] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:39:50] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:acme_chief: Improve type checking for certificates [puppet] - 10https://gerrit.wikimedia.org/r/789188 (owner: 10Jbond) [10:40:23] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:40:53] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:42:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM netflow6001.drmrs.wmnet [10:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [10:45:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netflow6001.drmrs.wmnet [10:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast6001.wikimedia.org [10:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35141/console" [puppet] - 10https://gerrit.wikimedia.org/r/790301 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:49:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:postgresql: rename wal_keep_segments to wal_keep_size [puppet] - 10https://gerrit.wikimedia.org/r/790301 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:50:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:51:37] (03PS3) 10Vgutierrez: mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) [10:52:38] (03CR) 10Muehlenhoff: scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [10:52:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast6001.wikimedia.org [10:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:27] (03CR) 10Ayounsi: [C: 03+1] C:postgresql: rename wal_keep_segments to wal_keep_size (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/790301 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:54:29] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10MoritzMuehlenhoff) [10:55:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM prometheus6001.drmrs.wmnet [10:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:31] (03PS1) 10Jbond: postgresql: fix typo in comments [puppet] - 10https://gerrit.wikimedia.org/r/790305 [10:58:00] (03CR) 10Jbond: [V: 03+2 C: 03+2] postgresql: fix typo in comments [puppet] - 10https://gerrit.wikimedia.org/r/790305 (owner: 10Jbond) [10:59:26] (03CR) 10Jbond: [C: 03+2] P:netbox: Add libapache2-mod-wsgi-py3 [puppet] - 10https://gerrit.wikimedia.org/r/789821 (owner: 10Jbond) [10:59:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM prometheus6001.drmrs.wmnet [10:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:05:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir6001.drmrs.wmnet [11:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:27] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1010.eqiad.wmnet [11:07:29] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ms-fe1010.eqiad.wmnet [11:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:57] <_joe_> !log removing stale files from config-master on puppetmaster1001; this could cause some flapping confd alerts [11:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:03] (03PS10) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [11:11:07] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [11:11:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1010.eqiad.wmnet [11:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:38] <_joe_> !log removing stale files from config-master on puppetmaster2001; this could cause some flapping confd alerts [11:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:14] (03PS1) 10KartikMistry: CX3 Build 0.2.0+20220509 [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790328 (https://phabricator.wikimedia.org/T306643) [11:13:29] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:16:40] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1010.eqiad.wmnet [11:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:21] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:21:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir6001.drmrs.wmnet [11:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ncredir6002.drmrs.wmnet [11:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:25] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:23:25] (03PS1) 10Slyngshede: Rewrite logster::job to use systemd timers. This patch does change the meaning of the variable "weekday", going from being a number between 0 and 6, to a string. Weekday is currently never used (only minutes are), the change will only affect patches and roles that have yet to be written. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T123456) [11:24:01] (03CR) 10jerkins-bot: [V: 04-1] Rewrite logster::job to use systemd timers. This patch does change the meaning of the variable "weekday", going from being a number between 0 and 6, to a string. Weekday is currently never used (only minutes are), the change will only affect patches and roles that have yet to be written. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T123456) (owner: [11:24:23] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.493 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:28:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ncredir6002.drmrs.wmnet [11:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:38] (03PS1) 10WMDE-Fisch: Refresh MediaWiki globals when loading mapdata [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790329 (https://phabricator.wikimedia.org/T307650) [11:29:50] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10MoritzMuehlenhoff) [11:30:20] (03CR) 10jerkins-bot: [V: 04-1] CX3 Build 0.2.0+20220509 [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790328 (https://phabricator.wikimedia.org/T306643) (owner: 10KartikMistry) [11:31:09] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:32:08] (03PS2) 10Slyngshede: Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T123456) [11:32:44] (03CR) 10jerkins-bot: [V: 04-1] Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede) [11:34:09] (03PS1) 10Hashar: zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) [11:34:38] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [11:34:42] (03CR) 10jerkins-bot: [V: 04-1] zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [11:35:10] (03CR) 10KartikMistry: "recheck" [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790328 (https://phabricator.wikimedia.org/T306643) (owner: 10KartikMistry) [11:35:43] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:20] (03PS2) 10Hashar: zuul: disable core.logAllRefUpdates at clone time [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) [11:36:36] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [11:38:07] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:40:35] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:42:19] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:43:07] (03CR) 10Hashar: "Well it compiles https://puppet-compiler.wmflabs.org/pcc-worker1001/1321/contint2001.wikimedia.org/index.html :D" [puppet] - 10https://gerrit.wikimedia.org/r/790350 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [11:44:02] (03CR) 10jerkins-bot: [V: 04-1] Refresh MediaWiki globals when loading mapdata [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790329 (https://phabricator.wikimedia.org/T307650) (owner: 10WMDE-Fisch) [11:45:38] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ores2002.codfw.wmnet with OS buster [11:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:02] (03CR) 10WMDE-Fisch: "recheck" [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790329 (https://phabricator.wikimedia.org/T307650) (owner: 10WMDE-Fisch) [11:46:33] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 9.608 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:51:31] (03PS3) 10Slyngshede: Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T123456) [11:52:06] (03CR) 10jerkins-bot: [V: 04-1] Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede) [11:52:44] (03PS4) 10Slyngshede: Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T123456) [11:53:29] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1011.eqiad.wmnet [11:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:10] (03PS14) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [11:55:21] PROBLEM - PHP7 jobrunner on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:56:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:56:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35143/console" [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [11:57:26] RECOVERY - PHP7 jobrunner on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.443 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:57:54] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/790293 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [11:58:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35144/console" [puppet] - 10https://gerrit.wikimedia.org/r/787538 (owner: 10Jbond) [11:58:15] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1011.eqiad.wmnet [11:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:43] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1012.eqiad.wmnet [11:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:47] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1012.eqiad.wmnet [12:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:28] (03CR) 10Slyngshede: "This affects two classes, logster_alarm and toolforge::proxy and the varnishkafka::monitor::statsd module." [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede) [12:10:41] (03PS8) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [12:10:50] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:58] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2002.codfw.wmnet with reason: host reimage [12:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:27] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2002.codfw.wmnet with reason: host reimage [12:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:40] (03CR) 10Elukey: [C: 03+2] ml-services: update editquality and draftquality image [deployment-charts] - 10https://gerrit.wikimedia.org/r/790293 (https://phabricator.wikimedia.org/T301766) (owner: 10AikoChou) [12:15:55] PROBLEM - PHP7 rendering on mw1437 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:16:15] 10SRE, 10SRE-swift-storage: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10fgiunchedi) [12:16:55] RECOVERY - PHP7 rendering on mw1437 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 5.963 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:18:03] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe2010.codfw.wmnet [12:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:12] !log depool thanos-fe1001 to test load theory wrt account-stats failures - T307907 [12:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:16] T307907: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 [12:19:37] (03PS1) 10Marostegui: Revert "db1172: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/790330 [12:19:42] (03PS9) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [12:19:44] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10fgiunchedi) [12:20:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:31] (03CR) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [12:22:37] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2010.codfw.wmnet [12:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:52] (03PS6) 10Jaime Nuche: scap: add system package requirements for scap [puppet] - 10https://gerrit.wikimedia.org/r/789147 (https://phabricator.wikimedia.org/T306991) [12:22:58] (03CR) 10Marostegui: [C: 03+2] Revert "db1172: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/790330 (owner: 10Marostegui) [12:27:04] (03PS1) 10Marostegui: db1172: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/790354 (https://phabricator.wikimedia.org/T307546) [12:27:13] (03CR) 10Jaime Nuche: [C: 04-1] "https://phabricator.wikimedia.org/T307351 hasn't been completed yet" [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [12:27:41] (03CR) 10Marostegui: [C: 03+2] db1172: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/790354 (https://phabricator.wikimedia.org/T307546) (owner: 10Marostegui) [12:28:38] (03PS4) 10Jforrester: Disable LocalisationUpdate, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) [12:28:40] (03PS5) 10Jaime Nuche: scap: clone scap code on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/789148 (https://phabricator.wikimedia.org/T306991) [12:28:46] (03PS3) 10Jforrester: Disable LocalisationUpdate, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677327 (https://phabricator.wikimedia.org/T158360) [12:31:27] (03PS1) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [deployment-charts] - 10https://gerrit.wikimedia.org/r/790357 (https://phabricator.wikimedia.org/T117845) [12:31:48] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe2011.codfw.wmnet [12:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:37:30] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2011.codfw.wmnet [12:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:57] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe2012.codfw.wmnet [12:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:40:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [12:42:28] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2012.codfw.wmnet [12:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:08] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2002.codfw.wmnet with OS buster [12:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:35] !log klausman@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [12:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:42] !log klausman@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 07s) [12:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:06] !log installing perf updates on bullseye hosts [12:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:51] !log installing perf updates on stretch/buster hosts [12:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:57:52] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:59:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [12:59:52] hello the room [12:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T1300). [13:00:05] erayfield, koi, kart_, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] hi [13:00:33] o/ [13:00:39] I can deploy if no-one else is around [13:00:42] hi [13:00:47] newbie here, so please be nice [13:00:49] I'm sort of around, but I'd rather not deploy [13:01:13] urbanecm: sure, I'll take care of it then! [13:01:27] erayfield: o/ sure, welcome! do you have the x-wikimedia-debug browser extension installed? [13:01:31] kart_: hey, around? [13:01:37] appreciated. but do feel free to ping me if I'm needed :) [13:01:53] no [13:02:07] taavi: yes. Sorry for the delay. [13:03:31] taavi: may be we can do +2 on the first patch of wmf.10? It will take approx 20 min for CI there. [13:03:54] erayfield: ok, could you install it please? it's needed to test any changes on a staging server before rolling them out everywhere. download links are on https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage [13:04:20] kart_: sure, still looking at all the patches [13:06:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [13:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:23] (03CR) 10Majavah: [C: 03+2] ULS entrypoint: Do not show current language, fix domain redirects [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789832 (https://phabricator.wikimedia.org/T307745) (owner: 10KartikMistry) [13:07:05] kart_: +2'd the first one, do you want to leave the second one to merge separately or should I +2 it too? [13:07:07] erayfield: if you need help getting set up with the browser extension (or other parts of this process) I'm happy to help [13:07:36] installed on chrome [13:07:52] erayfield: hmm, actually: is it possible to test your change on an individual server? I am not familiar with how the MediaModeration extension works or when it generates log entries [13:08:40] we are changing the level for output to debug from warn, Taavi [13:08:57] will not use browser, command line [13:09:03] taavi: I'm not sure how to deploy both together as 2nd one is CX build. Let's deploy first one first :) [13:09:21] kart_: ack! [13:10:43] (03CR) 10Jbond: "lgtm see inline for fyi" [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [13:11:34] Taavi AHT or T&S Tools engineers will be running this script once a quarter - David Rochford will tell us when each run is needed. [13:11:35] Task where we communicate when the script has been run: https://phabricator.wikimedia.org/T258603 [13:11:35] Update this ticket with the final timestamp scanned. [13:11:36] Extension page: https://www.mediawiki.org/wiki/Extension:MediaModeration [13:11:36] Commands to run the script [13:11:37] Connect to maintenance server: [13:11:37] $ ssh mwmaint1002.eqiad.wmnet [13:11:38] OR [13:11:38] $ ssh mwmaint2002.codfw.wmnet [13:12:26] (03PS9) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [13:12:41] (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:13:28] erayfield: ok, to make sure I've understood correctly: I can deploy the config change without testing it, and then when your team is running the script they'll see if it works or not? [13:13:50] yes, we will be moving back in about a week [13:13:55] ok, sounds good [13:14:02] thanks! [13:14:21] (03PS8) 10Majavah: Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [13:14:25] (03CR) 10Majavah: [C: 03+2] Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [13:15:13] (03Merged) 10jenkins-bot: Set log level to 'debug' for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788777 (https://phabricator.wikimedia.org/T303312) (owner: 10EllenR) [13:16:44] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:788777|Set log level to 'debug' for mediamoderation (T303312)]] (duration: 00m 50s) [13:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:50] T303312: Set log level to 'debug' for mediamoderation - https://phabricator.wikimedia.org/T303312 [13:17:14] erayfield: ok, that change is now live on the production cluster (and will automatically get deployed to the beta cluster in the next 30 mins or so) [13:17:21] anything else? [13:17:41] we are good, thanks so much [13:17:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [13:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:47] happy to help [13:17:53] koi: yours are up next [13:18:13] (03PS2) 10Majavah: ptwiki: Revoke 500KB uploading limitation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789889 (https://phabricator.wikimedia.org/T307813) (owner: 10Stang) [13:18:15] thanks again, have a great Monday [13:18:18] (03CR) 10Majavah: [C: 03+2] ptwiki: Revoke 500KB uploading limitation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789889 (https://phabricator.wikimedia.org/T307813) (owner: 10Stang) [13:18:43] (03PS1) 10Jelto: gitlab_runner: move metrics listen_address to global section [puppet] - 10https://gerrit.wikimedia.org/r/790369 (https://phabricator.wikimedia.org/T295481) [13:19:04] (03Merged) 10jenkins-bot: ptwiki: Revoke 500KB uploading limitation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789889 (https://phabricator.wikimedia.org/T307813) (owner: 10Stang) [13:19:48] kostajh: can you test the first one on mwdebug1001? [13:19:57] sorry, that was for koi not kostajh [13:20:08] looking [13:20:32] lgtm [13:21:01] syncing [13:21:15] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10Ottomata) > For the duration of the upgrade plus some safety time windows before/after the upgrade, traffic will always be served from codfw and thus have the w... [13:21:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:21:32] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35145/console" [puppet] - 10https://gerrit.wikimedia.org/r/790369 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:48] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:789889|ptwiki: Revoke 500KB uploading limitation (T307813)]] (duration: 00m 50s) [13:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:52] T307813: Revoke 500KB limitation for uploading videos non-free-content on pt.wikipedia - https://phabricator.wikimedia.org/T307813 [13:22:00] taavi: Can I throw in an extra? Sorry, lost track of the time. [13:22:14] James_F: sure, add it to the calendar [13:22:39] (03PS4) 10Majavah: rowiki: Fix canonical namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789974 (https://phabricator.wikimedia.org/T127607) (owner: 10Stang) [13:22:42] (03CR) 10Majavah: [C: 03+2] rowiki: Fix canonical namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789974 (https://phabricator.wikimedia.org/T127607) (owner: 10Stang) [13:23:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [13:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:38] (03Merged) 10jenkins-bot: rowiki: Fix canonical namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789974 (https://phabricator.wikimedia.org/T127607) (owner: 10Stang) [13:23:43] taavi: Done: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/787829 [13:24:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:24:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:14] (03PS3) 10Jforrester: Set special footer licence message for MediaWiki.org re. Help: pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787829 (https://phabricator.wikimedia.org/T301483) [13:24:26] koi: pulled the second one to mwdebug1001, can you test that too? [13:24:27] (03CR) 10Ottomata: [C: 03+1] Increase the connect_timeout for eventgate based services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/790289 (owner: 10Btullis) [13:24:33] James_F: thanks! [13:24:36] yeah, looking [13:24:47] (03Merged) 10jenkins-bot: ULS entrypoint: Do not show current language, fix domain redirects [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/789832 (https://phabricator.wikimedia.org/T307745) (owner: 10KartikMistry) [13:24:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:21] taavi: wmf.10 patch merged just in time :D [13:25:33] (03CR) 10Ottomata: [C: 03+1] Increase the connect_timeout for eventgate based services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/790289 (owner: 10Btullis) [13:25:58] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) If that fixes it, at the very least it will give us some information about what is... [13:26:20] We can see if there is a time to deploy 2nd patch after kostajh and James_F's patches. Else, tomorrow is fine. [13:26:41] jouncebot: next [13:26:42] In 2 hour(s) and 3 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T1530) [13:26:56] !log failover ganeti master in codfw/test to ganeti-test2003 [13:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:26] kart_: I don't think it's a big problem if we need to go a few mins overtime [13:27:34] taavi, lgtm [13:27:43] taavi: cool. Thanks! [13:28:03] koi: ack, syncing [13:28:34] kart_: yours is next, do you want to self deploy or want me to deploy? [13:28:42] (03CR) 10Jdrewniak: [C: 03+2] search-redirect.php: Make sure the family is lowercased [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789972 (https://phabricator.wikimedia.org/T304629) (owner: 10Ladsgroup) [13:28:57] umh [13:29:06] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:789974|rowiki: Fix canonical namespaces (T127607)]] (duration: 00m 51s) [13:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:11] T127607: Fix canonical namespaces for rowiki - https://phabricator.wikimedia.org/T127607 [13:29:13] is Jdrewniak here? [13:29:22] taavi: Is the portal update merged but not deployed? [13:29:27] (03Merged) 10jenkins-bot: search-redirect.php: Make sure the family is lowercased [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789972 (https://phabricator.wikimedia.org/T304629) (owner: 10Ladsgroup) [13:29:36] taavi: If so, don't worry about it. [13:29:36] no, see wikibugs above [13:29:49] aka https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/789972/ being merged in the middle of this window without any coordination [13:29:54] Oh, meh. [13:29:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:29] (03CR) 10Jforrester: "Why was this production config patch merged in the middle of the on-going deployment window?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789972 (https://phabricator.wikimedia.org/T304629) (owner: 10Ladsgroup) [13:31:12] taavi: Can you please deploy, I'll focus on testing part today. [13:31:29] taavi: Yeah, just sync it anyway (or I can). Sorry about this. :-( [13:31:29] jouncebot: nowandnext [13:31:29] For the next 0 hour(s) and 28 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T1300) [13:31:30] In 1 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T1530) [13:31:36] kart_: sure! [13:31:52] (03CR) 10Ssingh: [C: 03+2] icinga: Grant BCornwall host service command privs [puppet] - 10https://gerrit.wikimedia.org/r/789881 (owner: 10BCornwall) [13:32:12] PROBLEM - ganeti-wconfd running on ganeti-test2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:32:21] (03PS1) 10Jdrewniak: merged by accident before a scheduled deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790335 [13:32:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:32:41] (03CR) 10Jforrester: [C: 04-1] "No, it's fine; just be more careful in future! :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790335 (owner: 10Jdrewniak) [13:32:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:02] kart_: pulled to mwdebug1001, can you test please? [13:33:32] Amir1: hello, do you want to test that search-redirect.php patch before I sync it? [13:33:33] (03CR) 10Ladsgroup: search-redirect.php: Make sure the family is lowercased (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789972 (https://phabricator.wikimedia.org/T304629) (owner: 10Ladsgroup) [13:33:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:33:41] taavi: testing. [13:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:46] taavi: sure! [13:34:10] Amir1: ok, that too is now on mwdebug1001 [13:34:38] (03PS10) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [13:35:20] taavi: works as expected. [13:35:29] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Parser test failures appear unrelated." [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790329 (https://phabricator.wikimedia.org/T307650) (owner: 10WMDE-Fisch) [13:35:36] (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:35:45] James_F taavi sorry for the accidental +2, I was planning to deploy it after the window [13:36:02] (03CR) 10Jdrewniak: [C: 03+2] "I shouldn't look at Gerrit before I've had a coffee... my mistake" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789972 (https://phabricator.wikimedia.org/T304629) (owner: 10Ladsgroup) [13:36:02] taavi: All good. Please deploy :) [13:36:11] Amir1: no worries [13:36:15] Amir1: It happens. [13:36:19] !log taavi@deploy1002 Synchronized docroot/wwwportal/w/search-redirect.php: Config: [[gerrit:789972|search-redirect.php: Make sure the family is lowercased (T304629)]] (duration: 00m 51s) [13:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:24] T304629: Prepare the updated www.wiktionary.org page for deployment - https://phabricator.wikimedia.org/T304629 [13:36:27] (03PS10) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [13:36:31] Will Amir1 get sticker(s)? [13:36:39] (03Abandoned) 10Ladsgroup: merged by accident before a scheduled deploy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790335 (owner: 10Jdrewniak) [13:36:55] kart_: you should have seen today morning [13:37:01] :) [13:37:15] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/ContentTranslation/modules/entrypoints: Backport: [[gerrit:789832|ULS entrypoint: Do not show current language, fix domain redirects (T307745 T298032)]] (duration: 00m 50s) [13:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:20] T298032: Surface missing languages in current mobile language selector to access Section Translation - https://phabricator.wikimedia.org/T298032 [13:37:20] T307745: Current language shown in missing languages banner - https://phabricator.wikimedia.org/T307745 [13:37:26] kart_: so I'll +2 the other patch and we'll get to deploying it after the other config patches are done? [13:37:36] taavi: yes. Sure! [13:37:37] (03PS13) 10Majavah: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [13:37:41] (03CR) 10Majavah: [C: 03+2] CX3 Build 0.2.0+20220509 [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790328 (https://phabricator.wikimedia.org/T306643) (owner: 10KartikMistry) [13:38:22] kostajh: we are finally at your patch! do you want to self-deploy or want me to deploy? [13:38:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:57] (03CR) 10Jaime Nuche: [C: 04-1] scap: add new `scap` user to deployment hosts and scap targets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [13:39:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [13:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:28] taavi: can you deploy please? [13:39:35] sure [13:39:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:39:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:39:40] ty [13:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:46] (03CR) 10Majavah: [C: 03+2] Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [13:40:31] (03Merged) 10jenkins-bot: Newcomer tasks: deploy AND topic selection to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780874 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [13:40:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:51] kostajh: pulled to mwdebug1001 [13:41:00] taavi: thanks, looking [13:41:48] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [13:41:50] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host centrallog2002.codfw.wmnet [13:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:31] taavi: bah, the patch that the config relies on isn't in wmf.10 :( IMO it's fine to leave this config patch enabled, and we can either do a backport for the relevant code (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/780873) or wait for next week's train. do you have a preference? [13:45:36] sorry, I should have double checked this earlier. [13:45:36] (03CR) 10Btullis: "There's a discussion on #wikimedia-traffic about continuing to trace the eventgate processes to find out why the connect_timeout is being " [deployment-charts] - 10https://gerrit.wikimedia.org/r/790289 (owner: 10Btullis) [13:45:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:17] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) @jbond I can't think of any recent changes that would have introduced this behavior. The boxes were rebooted on Friday to catch the l... [13:46:44] kostajh: if you feel it's safe, it's probably easiest to just leave it enabled and wait for the train to roll forward [13:46:53] taavi: let's do that [13:46:55] thanks [13:46:58] ok, sure [13:47:03] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10herron) Looking at count of log lines matching "BDAT command used when CHUNKING not advertised" on mx1001 this appears to have began on the 5th,... [13:47:16] (03CR) 10Filippo Giunchedi: [C: 03+2] traffic: port LVS traffic/cpu alerts to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/789094 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:47:20] (03PS3) 10Filippo Giunchedi: traffic: port LVS traffic/cpu alerts to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/789094 (https://phabricator.wikimedia.org/T305847) [13:47:58] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:780874|Newcomer tasks: deploy AND topic selection to pilot wikis (T305399)]] (duration: 00m 49s) [13:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:04] T305399: Newcomer tasks: deploy AND selection to pilot wikis - https://phabricator.wikimedia.org/T305399 [13:48:19] James_F: yours is next, want to deploy that yourself or do you prefer that I deploy it? [13:48:27] (03PS1) 10Gergő Tisza: CampaignConfig: Avoid array_push() error [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790336 [13:48:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:48:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:41] taavi: Please go ahead. [13:48:51] (03PS4) 10Majavah: Set special footer licence message for MediaWiki.org re. Help: pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787829 (https://phabricator.wikimedia.org/T301483) (owner: 10Jforrester) [13:48:55] (03CR) 10Majavah: [C: 03+2] Set special footer licence message for MediaWiki.org re. Help: pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787829 (https://phabricator.wikimedia.org/T301483) (owner: 10Jforrester) [13:49:08] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) There's some discussion ongoing in [[https://wm-bot.wmflabs.org/libera_logs/%23wikim... [13:49:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti-test2001.codfw.wmnet [13:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:06] (03Merged) 10jenkins-bot: Set special footer licence message for MediaWiki.org re. Help: pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787829 (https://phabricator.wikimedia.org/T301483) (owner: 10Jforrester) [13:51:13] pulled to mwdebug1001, please test [13:52:20] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: old kernel :( [13:52:21] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: old kernel :( [13:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:29] (03CR) 10Filippo Giunchedi: [C: 03+2] lvs: remove rx/cpu alerts, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/789144 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:52:59] taavi: Hmm. It looks wrong to me. [13:54:04] want me to revert? [13:54:17] Yeah, please. :-( [13:54:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:39] (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20220509 [extensions/ContentTranslation] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790328 (https://phabricator.wikimedia.org/T306643) (owner: 10KartikMistry) [13:54:45] (03PS1) 10Majavah: Revert "Set special footer licence message for MediaWiki.org re. Help: pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790337 [13:54:53] (03CR) 10Majavah: [C: 03+2] ":(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790337 (owner: 10Majavah) [13:55:42] kart_: in the meantime, pulled your second patch to mwdebug1001 [13:55:45] (03Merged) 10jenkins-bot: Revert "Set special footer licence message for MediaWiki.org re. Help: pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790337 (owner: 10Majavah) [13:56:12] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: old kernel :( [13:56:13] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: old kernel :( [13:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:27] taavi: OK. Testing. [13:56:29] taavi: Thanks! Will debug locally. [13:57:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:57:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:57:56] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:58:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:53] (03PS1) 10Jelto: gitlab: separate hiera config for gitlab hosts [puppet] - 10https://gerrit.wikimedia.org/r/790371 (https://phabricator.wikimedia.org/T307142) [14:02:03] kart_: how is testing going? [14:03:04] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35146/console" [puppet] - 10https://gerrit.wikimedia.org/r/790371 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [14:03:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:24] taavi: sorry. Took long. Looks good. Please deploy. [14:03:39] syncing! [14:04:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:04:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:37] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/ContentTranslation/app: Backport: [[gerrit:790328|CX3 Build 0.2.0+20220509 (T306643)]] (duration: 00m 51s) [14:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:43] T306643: Section Translation picking the wrong section when creating a new article - https://phabricator.wikimedia.org/T306643 [14:04:44] aaand that's live as well [14:04:58] Cool. Thanks a lot taavi! [14:05:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:05:02] hth [14:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:19] !log UTC afternoon backport window done [14:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:30] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:09] (03CR) 10jerkins-bot: [V: 04-1] CampaignConfig: Avoid array_push() error [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790336 (owner: 10Gergő Tisza) [14:15:05] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10MatthewVernon) FWIW, we do occasionally see this on ms-* too, but I can never repro on demand, which might support a load-related cause; I never found much in logs. Could... [14:27:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35148/console" [puppet] - 10https://gerrit.wikimedia.org/r/787538 (owner: 10Jbond) [14:30:22] (03CR) 10Thcipriani: [C: 03+1] Revert "No longer install subversion on Phabricator hosts" [puppet] - 10https://gerrit.wikimedia.org/r/789958 (https://phabricator.wikimedia.org/T307889) (owner: 10Hashar) [14:33:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:40:35] (03PS2) 10Hnowlan: New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) [14:43:00] (03CR) 10jerkins-bot: [V: 04-1] New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [14:43:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [14:52:30] (03PS4) 10Vgutierrez: mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) [14:57:04] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:48] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:03:39] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10herron) Hi @bcampbell, while SRE is investigating could ITS please open a case with the google postmasters about this issue as well? We have no... [15:03:44] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jbond) demonstrating the we support chunking ` $ telnet -4 mx1001.wikimedia.org 25 Tryin... [15:04:34] (03PS1) 10David Caro: cloudvirt: redirect prometheus script errors to journal [puppet] - 10https://gerrit.wikimedia.org/r/790387 [15:04:36] (03PS1) 10David Caro: cloudvirt-libvirt-stats: Avoid printing to stdout [puppet] - 10https://gerrit.wikimedia.org/r/790388 [15:05:23] (03CR) 10jerkins-bot: [V: 04-1] cloudvirt: redirect prometheus script errors to journal [puppet] - 10https://gerrit.wikimedia.org/r/790387 (owner: 10David Caro) [15:09:08] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10bcampbell) Thank you @herron and @jbond I'll open a ticket with Google now and keep you updated. [15:11:36] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jbond) My reading of https://seclists.org/oss-sec/2017/q4/324 suggests that if a BDAT command is issued after the mail or RCPT command then exim... [15:13:08] (03CR) 10Ssingh: [C: 03+1] "Thanks a lot for working on this and for the patch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 (owner: 10Muehlenhoff) [15:14:10] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:15:10] (03PS2) 10David Caro: cloudvirt: redirect prometheus script errors to journal [puppet] - 10https://gerrit.wikimedia.org/r/790387 [15:15:12] (03PS2) 10David Caro: cloudvirt-libvirt-stats: Avoid printing to stdout [puppet] - 10https://gerrit.wikimedia.org/r/790388 [15:16:23] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jbond) >>! In T307873#7914164, @jbond wrote: > My reading of https://seclists.org/oss-sec/2017/q4/324 suggests that if a BDAT command is issued... [15:16:36] (03CR) 10Muehlenhoff: sre.ganeti.reboot-vm: Add an option to skip waiting for successful Puppet run (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 (owner: 10Muehlenhoff) [15:17:26] (03CR) 10Ssingh: [C: 03+1] sre.ganeti.reboot-vm: Add an option to skip waiting for successful Puppet run (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 (owner: 10Muehlenhoff) [15:17:47] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) First messages in the logs appeared on May 4th: ` $ zgrep "BDAT command used when CHUNKING not advertised" /var/log/exim4/mainlog.5.... [15:18:07] (03PS1) 10Ayounsi: Netbox: clearly specify if on a dev server [puppet] - 10https://gerrit.wikimedia.org/r/790390 (https://phabricator.wikimedia.org/T296452) [15:18:44] (03CR) 10Ayounsi: "Not tested but I figured it would be helpful to not mess with prod." [puppet] - 10https://gerrit.wikimedia.org/r/790390 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:19:32] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.reboot-vm: Add an option to skip waiting for successful Puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/790266 (owner: 10Muehlenhoff) [15:21:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/790390 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:22:08] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:docker_registry_ha::web: use puppetdb_query instead of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787538 (owner: 10Jbond) [15:22:10] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/35150/" [puppet] - 10https://gerrit.wikimedia.org/r/790390 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:22:21] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790392 (https://phabricator.wikimedia.org/T128546) [15:22:36] (03CR) 10Vgutierrez: "Checking the current behavior of MediaWiki I see that right now MW emits a non-cacheable 404 for BadTitle. Tested against https://en.wikip" [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [15:23:29] (03PS2) 10Ayounsi: Netbox: clearly specify if on a dev server [puppet] - 10https://gerrit.wikimedia.org/r/790390 (https://phabricator.wikimedia.org/T296452) [15:24:35] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35151/" [puppet] - 10https://gerrit.wikimedia.org/r/790390 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:25:18] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Netbox: clearly specify if on a dev server [puppet] - 10https://gerrit.wikimedia.org/r/790390 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:28:38] (03PS3) 10BCornwall: admin: Add user "brett" to ops group [puppet] - 10https://gerrit.wikimedia.org/r/789727 [15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T1530). [15:30:34] (03CR) 10Bking: [Beta Cluster] LabsServices: Switch elastic hosts to bullseye hosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789877 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [15:31:30] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790392 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:31:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35152/console" [puppet] - 10https://gerrit.wikimedia.org/r/787541 (owner: 10Jbond) [15:32:19] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790392 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:33:31] (03PS1) 10BCornwall: Add Kwaku Addo Ofori to ops manager approval list [puppet] - 10https://gerrit.wikimedia.org/r/790393 [15:34:14] !log depool doh6001 (as part of T307427) [15:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:19] T307427: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 [15:35:00] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/787541 (owner: 10Jbond) [15:35:30] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh6001.wikimedia.org [15:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:33] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM doh6001.wikimedia.org [15:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:38] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10ops-monitoring-bot) VM doh6001.wikimedia.org rebooted by sukhe@cumin2002 with reason: None [15:35:46] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh6001.wikimedia.org [15:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:54] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10ops-monitoring-bot) VM doh6001.wikimedia.org rebooted by sukhe@cumin2002 with reason: None [15:36:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:19] (03PS1) 10Jbond: Revert "P:mariadb::proxy::multiinstance_replicas: drop use of query_facts" [puppet] - 10https://gerrit.wikimedia.org/r/790338 [15:36:26] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:mariadb::proxy::multiinstance_replicas: drop use of query_facts" [puppet] - 10https://gerrit.wikimedia.org/r/790338 (owner: 10Jbond) [15:37:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:37:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:37:11] (03PS1) 10Jbond: Revert "Revert "P:mariadb::proxy::multiinstance_replicas: drop use of query_facts"" [puppet] - 10https://gerrit.wikimedia.org/r/790339 [15:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:05] (03PS2) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/790339 [15:38:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:09] (03PS3) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/790339 [15:39:13] (03CR) 10BCornwall: [C: 03+2] admin: Add user "brett" to ops group [puppet] - 10https://gerrit.wikimedia.org/r/789727 (owner: 10BCornwall) [15:39:49] (03PS1) 10Jdrewniak: Revert "Bumping portals to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790340 [15:40:32] (03CR) 10Jdrewniak: [C: 03+2] Revert "Bumping portals to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790340 (owner: 10Jdrewniak) [15:40:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35153/console" [puppet] - 10https://gerrit.wikimedia.org/r/790339 (owner: 10Jbond) [15:41:04] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:41:10] ^ wikidough [15:41:14] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:41:27] (03Merged) 10jenkins-bot: Revert "Bumping portals to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790340 (owner: 10Jdrewniak) [15:41:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh6001.wikimedia.org [15:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:07] please don't merge the change on puppetmaster [15:42:13] there was some confusion, we will rever it [15:42:18] PROBLEM - Bird Internet Routing Daemon on doh6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:42:18] PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh6001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:43:17] (03PS1) 10Ssingh: Revert "admin: Add user "brett" to ops group" [puppet] - 10https://gerrit.wikimedia.org/r/790341 [15:43:24] (03PS2) 10BCornwall: Add Kwaku Addo Ofori to ops manager approval list [puppet] - 10https://gerrit.wikimedia.org/r/790393 [15:44:06] PROBLEM - Check systemd state on doh6001 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:15] ^ looking, related to the restart [15:44:23] (03CR) 10Muehlenhoff: [C: 03+1] "Good catch, that approval list was assembled when the Traffic SRE team didn't have a dedicated engineering manager." [puppet] - 10https://gerrit.wikimedia.org/r/790393 (owner: 10BCornwall) [15:44:38] RECOVERY - Bird Internet Routing Daemon on doh6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:44:38] RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh6001 is OK: OK: UP (pid=2925) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:44:49] (03CR) 10Ssingh: [C: 03+2] Revert "admin: Add user "brett" to ops group" [puppet] - 10https://gerrit.wikimedia.org/r/790341 (owner: 10Ssingh) [15:45:38] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:45:48] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:46:21] (03PS3) 10Hnowlan: New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) [15:46:24] RECOVERY - Check systemd state on doh6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:41] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10grin) My suggestions: - Please first **remove the google servers from the callout cache**, and you may also consider examining what caused call... [15:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:48:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:04] (03CR) 10jerkins-bot: [V: 04-1] New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:49:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:49:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:58] (03PS4) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/790339 [15:50:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35154/console" [puppet] - 10https://gerrit.wikimedia.org/r/790339 (owner: 10Jbond) [15:51:45] (03PS1) 10DLynch: Release DiscussionTools new topic tool to former a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790395 (https://phabricator.wikimedia.org/T307410) [15:54:10] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10ssingh) [15:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:57:31] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10herron) p:05High→03Medium `'chunking_advertise_hosts ='` (disabling chunking) has been applied to both MXes and we have not seen this error... [16:00:42] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:00:59] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) Looking at https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&viewPane... [16:01:37] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: sre.ganeti.reboot-vm cookbook should re-enable Puppet if it was disabled - https://phabricator.wikimedia.org/T307792 (10ssingh) Thanks Moritz for the patch: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/790266. I tried this with doh6001 for T3... [16:02:47] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM doh6002.wikimedia.org [16:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:53] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10ops-monitoring-bot) VM doh6002.wikimedia.org rebooted by sukhe@cumin2002 with reason: None [16:03:02] !log depool doh6002 (as part of T307427) [16:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:06] T307427: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 [16:04:04] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:05:15] (03PS15) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [16:06:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35155/console" [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [16:07:02] !log restart elasticsearch_6@production-search-psi-eqiad on elastic1049 to resolve CirrusSearchJVMGCOldPoolFlatlined [16:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [16:08:20] (03PS16) 10Jbond: P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 [16:08:45] (03PS1) 10Lucas Werkmeister (WMDE): Configure wgLexemeLexicalCategoryItemIds on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790398 (https://phabricator.wikimedia.org/T298150) [16:09:08] (03PS5) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/790339 [16:10:04] (03CR) 10D3r1ck01: [C: 03+1] "Looks good." [software/spicerack] - 10https://gerrit.wikimedia.org/r/789923 (https://phabricator.wikimedia.org/T201491) (owner: 10Klein Muçi) [16:10:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35156/console" [puppet] - 10https://gerrit.wikimedia.org/r/790339 (owner: 10Jbond) [16:10:17] !log razzi@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1003.eqiad.wmnet [16:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM doh6002.wikimedia.org [16:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:51] (03PS6) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/790339 [16:12:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35157/console" [puppet] - 10https://gerrit.wikimedia.org/r/790339 (owner: 10Jbond) [16:13:59] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1003.eqiad.wmnet [16:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:48] (03PS7) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/790339 [16:15:20] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:15:30] (03PS1) 10Lucas Werkmeister (WMDE): Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) [16:15:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35158/console" [puppet] - 10https://gerrit.wikimedia.org/r/790339 (owner: 10Jbond) [16:15:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "Just parking this here for now, not to be deployed quite yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) (owner: 10Lucas Werkmeister (WMDE)) [16:20:29] (03PS8) 10Jbond: P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/790339 [16:21:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35159/console" [puppet] - 10https://gerrit.wikimedia.org/r/790339 (owner: 10Jbond) [16:22:48] 10SRE, 10serviceops: Provide node14 and node16 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10bd808) >>! In T306996#7912881, @Joe wrote: > I've build and published the `nodejs14-slim` and the `nodejs16-slim` images, using the nodejs package from the components.... [16:23:51] (03PS1) 10Ayounsi: Netbox: Add 2.11 configuration knobs [puppet] - 10https://gerrit.wikimedia.org/r/790400 (https://phabricator.wikimedia.org/T296452) [16:24:42] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:mariadb::proxy::multiinstance_replicas: drop use of query_facts [puppet] - 10https://gerrit.wikimedia.org/r/790339 (owner: 10Jbond) [16:29:33] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35160/" [puppet] - 10https://gerrit.wikimedia.org/r/790400 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [16:31:08] (03PS1) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wiktionary vhost [puppet] - 10https://gerrit.wikimedia.org/r/790402 (https://phabricator.wikimedia.org/T304629) [16:33:50] jouncebot: nowandnext [16:33:50] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [16:33:50] In 0 hour(s) and 26 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T1700) [16:33:56] (03PS1) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [16:34:07] I go mess with mwdebug to test https://gerrit.wikimedia.org/r/790402 [16:35:28] (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [16:36:39] (03PS2) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wiktionary vhost [puppet] - 10https://gerrit.wikimedia.org/r/790402 (https://phabricator.wikimedia.org/T304629) [16:37:04] (03CR) 10Ladsgroup: [C: 03+2] "Tested in mwdebug1002, works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/790402 (https://phabricator.wikimedia.org/T304629) (owner: 10Ladsgroup) [16:37:55] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wwwportals: Make sure portal assets are also visible in wiktionary vhost [puppet] - 10https://gerrit.wikimedia.org/r/790402 (https://phabricator.wikimedia.org/T304629) (owner: 10Ladsgroup) [16:39:07] (03PS1) 10Hnowlan: tegola: reduce number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/790405 (https://phabricator.wikimedia.org/T307757) [16:40:58] 10SRE, 10Infrastructure-Foundations, 10Mail: [Urgent] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10bcampbell) I heard back from SADA, our Google vendor. "Hope you're doing well! We are not currently aware of any changes to how Google would be... [16:41:54] check out https://www.wiktionary.org/ in mwdebug1002 ^^ [16:42:57] /ac/wg 2 [16:43:54] (03CR) 10Jdrewniak: [C: 03+1] wwwportals: Make sure portal assets are also visible in wiktionary vhost [puppet] - 10https://gerrit.wikimedia.org/r/790402 (https://phabricator.wikimedia.org/T304629) (owner: 10Ladsgroup) [16:46:29] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:16] (03CR) 10Hnowlan: [C: 03+2] ci: Provide basic `.pipeline/config.yaml` [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/789636 (https://phabricator.wikimedia.org/T307507) (owner: 10Dduvall) [16:49:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:05] (03Merged) 10jenkins-bot: ci: Provide basic `.pipeline/config.yaml` [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/789636 (https://phabricator.wikimedia.org/T307507) (owner: 10Dduvall) [16:52:01] (03PS1) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790406 (https://phabricator.wikimedia.org/T305399) [16:54:04] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) Removed all Dell switches from Netbox [16:55:27] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [16:55:43] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [16:56:00] (JobUnavailable) firing: (3) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:57:54] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) 05Open→03Resolved The hardware testing is complete please see in the description for links on all the testing results. The hardware has been dropped off at shipping today for return [16:58:36] (03PS1) 10Ayounsi: Update submodule and requirements for 2.11.12 [software/netbox-deploy] (2-11-12) - 10https://gerrit.wikimedia.org/r/790407 (https://phabricator.wikimedia.org/T296452) [17:00:05] ryankemper: That opportune time is upon us again. Time for a Wikidata Query Service weekly deploy deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T1700). [17:01:52] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:04:31] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) [17:04:36] (03PS4) 10Hnowlan: service: add image-suggestion ingress service [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) [17:05:13] jouncebot: nowandnext [17:05:13] For the next 0 hour(s) and 24 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T1700) [17:05:13] In 2 hour(s) and 54 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T2000) [17:05:38] ryankemper: are you deploying wdqs today? If not, I'm gonna steal your window [17:06:13] Amir1: you're good, feel free to proceed [17:06:35] Thanks! [17:10:28] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:43] 10SRE, 10ops-codfw: Dell switches testing: Setup mgmt for two servers for testing - https://phabricator.wikimedia.org/T305070 (10Papaul) [17:10:49] (03PS2) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [17:11:33] 10SRE, 10ops-codfw: Dell switches testing: Setup mgmt for two servers for testing - https://phabricator.wikimedia.org/T305070 (10Papaul) 05Open→03Resolved complete [17:11:36] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [17:11:48] (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [17:11:51] (03PS3) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [17:11:59] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) @Cmjohnson or @Jclark-ctr This server just went down, server itself AND mgmt at the same time. So we can't add much here. But it's only been purchased in 2021.So that should... [17:12:52] (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [17:13:33] (03PS4) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [17:13:55] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790345 [17:14:09] 10SRE, 10Infrastructure-Foundations, 10Mail: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10herron) [17:14:33] (03CR) 10Ladsgroup: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790345 (owner: 10Jdrewniak) [17:14:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:05] Amir1: OK if I merge a Beta Cluster config patch or two? They don't need a deploy, just a pull. [17:15:27] (03PS5) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [17:15:28] James_F: sure, I can rebase it too [17:15:33] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) IPMI from remote also fails: Error: Unable to establish IPMI v2 / RMCP+ session [17:15:48] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) p:05Triage→03Medium [17:15:55] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790345 (owner: 10Jdrewniak) [17:16:09] Amir1: Well, if you insist. ;-) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/789877 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/766602 and maybe https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/747494 I think? [17:16:20] niiiice [17:17:40] (03CR) 10Ladsgroup: [C: 03+2] [Beta Cluster] LabsServices: Switch elastic hosts to bullseye hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789877 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [17:18:27] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10Dzahn) We have already been debugging this a bit. When manually running the command I always get stats back. Even in a for-loop with 5 second sleep I could not reprodu... [17:18:30] (03CR) 10Majavah: [Beta Cluster] LabsServices: Use deployment-graphite01 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747494 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [17:19:04] (03Merged) 10jenkins-bot: [Beta Cluster] LabsServices: Switch elastic hosts to bullseye hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789877 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [17:19:32] (03CR) 10jerkins-bot: [V: 04-1] Newcomer tasks: deploy AND topic selection to pilot wikis [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790406 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [17:19:40] (03CR) 10Bartosz Dziewoński: [C: 03+1] Release DiscussionTools new topic tool to former a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790395 (https://phabricator.wikimedia.org/T307410) (owner: 10DLynch) [17:19:53] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10Dzahn) P.S. These are still crons and not systemd timers but T273673 says we should convert them all. So maybe this is a good time to do that. Then as a side effect we c... [17:20:19] taavi: I foolishly took someone C+1-ing the patch as a sign it probably was good to deploy. I'm such a fool. ;-) [17:20:30] hehe [17:20:31] (03PS4) 10Ladsgroup: [Beta Cluster] LabsServices: Move to buster restbase host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766602 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [17:20:37] (03CR) 10Ladsgroup: [C: 03+2] [Beta Cluster] LabsServices: Move to buster restbase host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766602 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [17:21:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:48] (03Merged) 10jenkins-bot: [Beta Cluster] LabsServices: Move to buster restbase host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766602 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [17:22:00] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) [17:22:03] !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:790345|Bumping portals to master (T304629)]] (duration: 00m 52s) [17:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:08] T304629: Deploy updated www.wiktionary.org page - https://phabricator.wikimedia.org/T304629 [17:22:21] James_F: and sorry for not noticing your comment earlier :/ [17:22:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:22:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:54] !log ladsgroup@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:790345|Bumping portals to master (T304629)]] (duration: 00m 50s) [17:22:56] taavi: No worries. On the same track via https://gerrit.wikimedia.org/r/q/file:wmf-config/LabsServices.php+is:open is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/684012 from you good to go? [17:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:15] that one should be fine [17:23:22] Anyway, I've got to run. [17:23:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:50] (03CR) 10Stang: "Ref: Ie08c4cd9e1243df5cecab4a409d623f8d71caffe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [17:24:07] (03CR) 10Dzahn: "I guess it depends on your definition of "used". I deactivated all remaining subversion repos and told Phabricator explicitly to ignore th" [puppet] - 10https://gerrit.wikimedia.org/r/789958 (https://phabricator.wikimedia.org/T307889) (owner: 10Hashar) [17:24:49] (03PS1) 10Ebernhardson: cirrus: Enable DeprecationLoggedHttps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790408 (https://phabricator.wikimedia.org/T218994) [17:28:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:21] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35161/" [puppet] - 10https://gerrit.wikimedia.org/r/790371 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [17:29:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:29:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:48] (03PS1) 10Clare Ming: Adjust table of contents margins at 1000-1200 breakpoint [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790426 (https://phabricator.wikimedia.org/T307004) [17:32:51] (03CR) 10Dzahn: [C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/790371 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [17:33:58] (03CR) 10Nray: [C: 03+1] Adjust table of contents margins at 1000-1200 breakpoint [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790426 (https://phabricator.wikimedia.org/T307004) (owner: 10Clare Ming) [17:34:57] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35162/gitlab-runner1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/790369 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:57:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:05:31] !log etherpad - maintenance reboot - expect a short downtime [18:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:09] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on etherpad1003.eqiad.wmnet with reason: reboot [18:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:12] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad1003.eqiad.wmnet with reason: reboot [18:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:01] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on mwmaint2002.codfw.wmnet with reason: reboot [18:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:04] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mwmaint2002.codfw.wmnet with reason: reboot [18:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:11] !log rebooting mwmaint2002 (not active maint server) [18:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:27] (03CR) 10Sergio Gimeno: [C: 04-1] "Cannot be backported as it is because the two files depend on each other" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790406 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [18:15:01] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10RLazarus) a:05jhathaway→03None Grabbing this from @jhathaway as I've taken over SRE clinic duty for this week. This is actually the right template for the u... [18:17:43] (03PS2) 10Krinkle: mediawiki: Remove route for /static/current/* (rewrite_static_assets) [deployment-charts] - 10https://gerrit.wikimedia.org/r/778601 (https://phabricator.wikimedia.org/T302465) [18:17:49] (03PS2) 10Krinkle: mediawiki: Remove unused rewrite_static_assets param [puppet] - 10https://gerrit.wikimedia.org/r/778602 (https://phabricator.wikimedia.org/T302465) [18:17:55] (03PS2) 10Krinkle: varnish: Expand static.php optimisation regarless of query string [puppet] - 10https://gerrit.wikimedia.org/r/777904 (https://phabricator.wikimedia.org/T302465) [18:19:30] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:19:52] (03CR) 10WMDE-Fisch: "recheck" [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790329 (https://phabricator.wikimedia.org/T307650) (owner: 10WMDE-Fisch) [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:38:11] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] striker: add fake hiera secrets [labs/private] - 10https://gerrit.wikimedia.org/r/790013 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [18:40:20] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10Ottomata) Approved. [18:46:35] (03CR) 10Hashar: Revert "No longer install subversion on Phabricator hosts" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789958 (https://phabricator.wikimedia.org/T307889) (owner: 10Hashar) [18:48:08] (03PS2) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790406 (https://phabricator.wikimedia.org/T305399) [18:48:34] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10Zabe) >>! In T307907#7914831, @Dzahn wrote: > P.S. These are still crons and not systemd timers but T273673 says we should convert them all. So maybe this is a good time... [18:49:48] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:50:03] (03PS3) 10Sergio Gimeno: Newcomer tasks: deploy AND topic selection to pilot wikis [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790406 (https://phabricator.wikimedia.org/T305399) [18:51:12] (03CR) 10Andrew Bogott: "I have not digested this but pcc results are here:" [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [18:51:33] 10SRE, 10SRE-Access-Requests, 10Scap: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 (10thcipriani) Tagging this with #sre-access-requests since (a) it's **sorta** an access request and (b) I'm not sure about the normal process for adding new identities for keyholder... [18:52:14] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:48] (03PS1) 10RLazarus: admin: Add hmonroy, dmaza to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/790414 (https://phabricator.wikimedia.org/T307737) [18:59:37] (03CR) 10BryanDavis: striker: Add profile to provision docker container (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [19:00:14] (03CR) 10Ssingh: [C: 03+1] admin: Add hmonroy, dmaza to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/790414 (https://phabricator.wikimedia.org/T307737) (owner: 10RLazarus) [19:01:09] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10ssingh) [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:07] (03PS1) 10Ebernhardson: changeprop: Update beta cluster domain names to .cloud [deployment-charts] - 10https://gerrit.wikimedia.org/r/790416 (https://phabricator.wikimedia.org/T307862) [19:04:23] (03CR) 10RLazarus: [C: 03+2] admin: Add hmonroy, dmaza to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/790414 (https://phabricator.wikimedia.org/T307737) (owner: 10RLazarus) [19:04:37] !log depool durum6001.drmrs.wmnet (as part of T307427) [19:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:42] T307427: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 [19:04:45] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: +2 for esther-akinloose in Gerrit (mediawiki/extensions/VisualEditor) - https://phabricator.wikimedia.org/T305373 (10RLazarus) Just picking up SRE clinic duty for the week -- I'm so sorry this has been sitting for so long! I'll ask around and try to find o... [19:06:15] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum6001.drmrs.wmnet [19:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:40] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10ops-monitoring-bot) VM durum6001.drmrs.wmnet rebooted by sukhe@cumin2002 with reason: None [19:07:56] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:11:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum6001.drmrs.wmnet [19:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:49] (03CR) 10jerkins-bot: [V: 04-1] Newcomer tasks: deploy AND topic selection to pilot wikis [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790406 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [19:16:25] 10SRE, 10SRE-Access-Requests, 10Community-Tech, 10Patch-For-Review: Grant Access to PII in Superset for HMonroy and Dmaza - https://phabricator.wikimedia.org/T307737 (10RLazarus) 05Open→03Resolved a:03RLazarus Thanks! You're both in the `wmf` group already, so nothing to do there: ` rzl@mwmaint1002... [19:17:06] !log depool durum6002.drmrs.wmnet (as part of T307427) [19:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:11] T307427: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 [19:17:36] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum6002.drmrs.wmnet [19:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:45] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10ops-monitoring-bot) VM durum6002.drmrs.wmnet rebooted by sukhe@cumin2002 with reason: None [19:24:22] 10SRE, 10SRE Observability: Reminders for unhandled/unacked alerts - https://phabricator.wikimedia.org/T307958 (10herron) p:05Triage→03Medium [19:25:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM durum6002.drmrs.wmnet [19:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:20] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10ssingh) [19:26:30] 10SRE, 10SRE Observability: Reminders for unhandled/unacked alerts - https://phabricator.wikimedia.org/T307958 (10herron) This idea arose in an irc convo while looking into stale icinga alerts on the unhandled dashboard, @Dzahn please add/adjust/edit anything I missed! [19:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:47:00] (03PS1) 10JHathaway: mx: disable chunking [puppet] - 10https://gerrit.wikimedia.org/r/790419 (https://phabricator.wikimedia.org/T307873) [19:50:24] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/790419 (https://phabricator.wikimedia.org/T307873) (owner: 10JHathaway) [19:50:36] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:27] (03CR) 10WMDE-Fisch: [C: 03+1] "CI errors really seem unrelated and also appear on other patches on 1.39.0-wmf.10." [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790329 (https://phabricator.wikimedia.org/T307650) (owner: 10WMDE-Fisch) [19:52:48] RECOVERY - DNS on logstash2028.mgmt is OK: DNS OK: 0.013 seconds response time. logstash2028.mgmt.codfw.wmnet returns 10.193.1.93 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:00:04] RoanKattouw, Urbanecm, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T2000). [20:00:04] WMDE-Fisch, ebernhardson, cjming, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:34] i can deploy [20:00:42] \o [20:01:08] Please go for it, it's a bit late for me and I rather concentrate on checking the result :-) [20:01:41] The patch currently -2 on CI but it's pretty much unrelated. [20:01:56] Seems to be a general issue on .10 atm. [20:02:08] (03CR) 10Clare Ming: [C: 03+2] Refresh MediaWiki globals when loading mapdata [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790329 (https://phabricator.wikimedia.org/T307650) (owner: 10WMDE-Fisch) [20:03:03] (03CR) 10Clare Ming: [C: 03+2] cirrus: Enable DeprecationLoggedHttps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790408 (https://phabricator.wikimedia.org/T218994) (owner: 10Ebernhardson) [20:04:00] (03Merged) 10jenkins-bot: cirrus: Enable DeprecationLoggedHttps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790408 (https://phabricator.wikimedia.org/T218994) (owner: 10Ebernhardson) [20:04:52] o/ [20:06:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:07:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:45] WMDE-Fisch: I went ahead and +2'd your patch - presumably it will still merge even with that one failure? [20:10:01] CI ^^ [20:10:45] cjming: No I guess not ... :-/ [20:11:46] Needs to be submitted and V+2ed manually ... but I guess when multiple source confirm that the CI errors are unrelated this is acceptable [20:13:00] also if anyone happens to know -- I perhaps prematurely +2'd ebernhardson's config patch and just noticed a bunch of config patches have been stuck in the merge queue for hours (2-15 hrs) -- I'm not sure what the right course of action is here [20:13:39] RoanKattouw: or urbanecm: if you're around, can you advise? [20:13:44] same for the GrowthExperiments patches btw, will need to be force-merged [20:14:18] tgr: yeah I was also looking at that one to confirm :-) [20:14:54] it's a known bug, some kind of circular dependency between parsoid and core parser tests [20:15:15] Oh yikes [20:15:19] cjming: the config patch did merge [20:15:34] tgr: so for all wmf.10 backports, we need to force merge? [20:16:17] not sure about all repos. All GrowthExperiments wmf.10 patches are affected for sure. [20:16:56] hashar looked into it last week, he might understand better what's happening [20:17:14] but in general the parser test error about images can be ignored [20:18:10] I think it only affects repos which have VisualEditor as a CI depdendency [20:18:28] (03CR) 10jerkins-bot: [V: 04-1] Refresh MediaWiki globals when loading mapdata [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790329 (https://phabricator.wikimedia.org/T307650) (owner: 10WMDE-Fisch) [20:19:11] operations/mediawiki-config [20:19:40] ...looks OK to me, everything with +2 is merged. Maybe that was a different merge queue? [20:20:06] tgr: sorry i meant the postmerge queue [20:21:08] there's like more than a dozen patches in that postmerge queue [20:21:28] cjming: So I just double checked the gate and submit test results on my .10 patch. Looks all good. [20:21:33] ah, OK. That means the beta cluster is not updating again. Not the end of the world. [20:21:56] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:21:56] let me file a task about it [20:22:03] thanks tgr [20:22:37] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10thcipriani) >>! In T303857#7901981, @Joe wrote: > To give some cont... [20:22:55] WMDE-Fisch: ok so I guess i have to force merge your patch -- not having done that before, I'm not quite sure of the steps [20:23:30] I can bumble through it but if someone here has done it, would love some confirmation that i'm doing it right [20:23:40] (03PS6) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [20:23:56] cjming: I think you have to remove the V -1 from jenkins and put a manual V: +2 there. [20:24:12] Then a submit button should be visible in the gerrit UI [20:24:24] When you hit that it should be merged [20:24:40] (03CR) 10Clare Ming: [V: 03+2 C: 03+2] Refresh MediaWiki globals when loading mapdata [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790329 (https://phabricator.wikimedia.org/T307650) (owner: 10WMDE-Fisch) [20:24:56] (03PS7) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [20:25:23] WMDE-Fisch: thanks - i removed jenkins V-1 and hit the `Verified +2` button [20:25:26] And then you can just continue deployment as if the CI has submitted. [20:25:43] Upper left corner in the UI is now a `Submit` link :-) [20:25:47] *right [20:25:49] (filed as T307963) [20:25:50] T307963: Config patches are stuck in the postmerge queue - https://phabricator.wikimedia.org/T307963 [20:26:27] WMDE-Fisch: got it - thanks! I thought I had to force merge cmd line on the deployment server -- this is more straightforward [20:26:37] :-) [20:28:30] WMDE-Fisch: is your patch something that can be tested? it's on mwdebug1001 [20:28:46] cjming: I'll have a quick look. [20:29:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:30:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:18] ebernhardson: i forgot to check if you're here - merged your patch, once I sync the Kartographer patch, I'll sync yours [20:30:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:47] cjming: All good go on! [20:30:53] tgr: thanks for filing ticket [20:31:02] WMDE-Fisch: ok - syncing! [20:32:25] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/Kartographer/modules/box: Backport: [[gerrit:790329|Refresh MediaWiki globals when loading mapdata (T307650)]] (duration: 00m 52s) [20:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:30] T307650: [versioned maps] Interactive map fails to render after save - https://phabricator.wikimedia.org/T307650 [20:32:30] WMDE-Fisch: should be live [20:32:42] cjming: Thanks! [20:34:12] ebernhardson: your patch is on mwdebug1001 -- if you're here, lmk if you can test -- otherwise I'll go ahead and sync since it seems pretty benign to me [20:35:54] (03CR) 10Clare Ming: [C: 03+2] Adjust table of contents margins at 1000-1200 breakpoint [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790426 (https://phabricator.wikimedia.org/T307004) (owner: 10Clare Ming) [20:36:15] the CI issue with ParserIntegrationTests and wmf/1.39.0-wmf.10 is more or less a known issue :/ [20:36:25] !log cjming@deploy1002 Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:790408|cirrus: Enable DeprecationLoggedHttps (T218994)]] (duration: 00m 51s) [20:36:27] ebernhardson: fyi, sync'd your patch - should be live [20:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:30] T218994: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 [20:37:23] tgr: just merging my patch to wmf.10 -- do you want to self-serve? also happy to do yours if you'd prefer [20:37:42] works for me either way [20:37:49] I tried to get it fixed by creating a wmf/1.39.0-wmf.10 branch in mediawiki/services/parsoid but apparently that did not fix it or timedMediaHandlerParserTests.txt is broken in some other case. Anyway the Parsoid team is aware of it [20:38:12] hashar: thanks for info - gtk [20:38:48] it is a mismatch with the parsoid version , some extensions having their parser tests targetting master [20:39:10] so essentially it is probably ignorable, but I Promise we will find a solution [20:43:50] tgr: I'm assuming my and your 2 patches have to be merged sequentially, not in parallel? [20:44:20] * hashar sleeps [20:44:43] cjming: they are unrelated but probably better to do one after the other, just in case on needs to be reverted [20:47:05] good point [20:47:06] The other day I tried to merge two patches in parallel to save time, the one that merged first broke the canaries, and I found out I didn't know how to revert multiple submodule commits simultaneously. That was unpleasant. [20:47:27] an important cautionary tale [20:48:08] The upside from CI being broken is that merging doesn't take 40 minutes per patch. [20:49:16] lol [20:49:25] (just realized why they're called canaries :( ) [20:49:56] concerned mine (bec Vector) will take 20+ minutes to merge based on the last few deployments I did [20:50:30] at least they can be reverted to the previous revision if something goes wrong. Real canaries would probably envy that. [20:51:45] tgr: bright side - you now know how to revert multiple submodule commits simultaneously? that would be good info to document somewhere if it isn't already [20:52:16] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:54:15] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) >>! In T307873#7914363, @grin wrote: > - Please first **remove the google servers from the callout cache**,... [20:55:46] I still don't know what git command would have worked. urbane.cm's solution was to just press the revert button in gerrit and then pull the changes. Slightly slower than doing stuff in git, but simple. [20:56:00] (JobUnavailable) firing: (3) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:57:40] (03Merged) 10jenkins-bot: Adjust table of contents margins at 1000-1200 breakpoint [skins/Vector] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790426 (https://phabricator.wikimedia.org/T307004) (owner: 10Clare Ming) [21:00:04] Reedy, sbassett, Maryum, and manfredi: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220509T2100). Please do the needful. [21:01:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:54] tgr: ok just finished syncing my patch -- i guess we have to force merge your 1st one? [21:02:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:02:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:16] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.10/skins/Vector/resources: Backport: [[gerrit:790426|Adjust table of contents margins at 1000-1200 breakpoint (T307004)]] (duration: 00m 53s) [21:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:20] T307004: Fix spacing/margins for screens between 1000 and 1200 px - https://phabricator.wikimedia.org/T307004 [21:02:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:12] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:04] (03CR) 10Clare Ming: [C: 03+2] CampaignConfig: Avoid array_push() error [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790336 (owner: 10Gergő Tisza) [21:05:32] cjming: yeah [21:06:12] do i have to wait for CI -1 from jenkins or can i just go ahed and click "Verified+2"? [21:08:04] no, you can just immediately +2/+2 and then submit. [21:08:10] (03CR) 10Clare Ming: [V: 03+2 C: 03+2] CampaignConfig: Avoid array_push() error [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790336 (owner: 10Gergő Tisza) [21:09:38] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:09:44] alrighty then - tgr: is your 1st patch something that can be tested on mwdebug1001? [21:10:20] or should I just go ahead and sync? [21:10:25] I can at least check nothing is broken [21:10:35] that's always a good call [21:12:39] lgtm - shall it be live? [21:12:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:24] cjming: looks good, thanks [21:13:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:13:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:42] tgr: ok syncing now [21:14:16] tgr: i guess same process with your 2nd patch - does it need a rebase? [21:14:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:26] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/GrowthExperiments/includes/NewcomerTasks/CampaignConfig.php: Backport: [[gerrit:790336|CampaignConfig: Avoid array_push() error]] (duration: 00m 51s) [21:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:35] tgr: your 1st patch is live [21:14:41] no, that's specific to operations/mediawiki-config [21:14:49] most other repos rebase automatically [21:14:59] gtk [21:15:10] (03CR) 10Clare Ming: [C: 03+2] Newcomer tasks: deploy AND topic selection to pilot wikis [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790406 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [21:15:14] (03CR) 10Clare Ming: [V: 03+2 C: 03+2] Newcomer tasks: deploy AND topic selection to pilot wikis [extensions/GrowthExperiments] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790406 (https://phabricator.wikimedia.org/T305399) (owner: 10Sergio Gimeno) [21:17:03] tgr: your 2nd patch is up on mwdebug1001 - lgtm - can you confirm? [21:17:44] works, thanks! [21:17:55] cool - syncing then [21:18:57] !log cjming@deploy1002 Synchronized php-1.39.0-wmf.10/extensions/GrowthExperiments: Backport: [[gerrit:790406|Newcomer tasks: deploy AND topic selection to pilot wikis (T305399)]] (duration: 00m 54s) [21:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:01] T305399: Newcomer tasks: deploy AND selection to pilot wikis - https://phabricator.wikimedia.org/T305399 [21:19:05] tgr: 2nd patch is live [21:19:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:30] thanks for the deploys! [21:19:36] np - thanks for your help! [21:19:40] !log end of UTC late backport & config window [21:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:20:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:40] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:46:24] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790419 (https://phabricator.wikimedia.org/T307873) (owner: 10JHathaway) [21:47:31] (03PS1) 10Jforrester: deployment-prep: Drop deployment-restbase03, no longer to be used [puppet] - 10https://gerrit.wikimedia.org/r/790424 (https://phabricator.wikimedia.org/T306052) [21:49:23] (03PS1) 10Jforrester: changeprop: Switch Beta Cluster RESTbase target server to restbase04 [deployment-charts] - 10https://gerrit.wikimedia.org/r/790425 (https://phabricator.wikimedia.org/T306052) [21:53:25] (03CR) 10JHathaway: [C: 03+2] mx: disable chunking [puppet] - 10https://gerrit.wikimedia.org/r/790419 (https://phabricator.wikimedia.org/T307873) (owner: 10JHathaway) [21:54:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Jclark-ctr) @nskaggs Can you confirm these racks still work for this task and nothing has changed? before i continue to cable these. Thanks... [21:55:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Jclark-ctr) [21:56:13] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: new kernel, round deux [21:56:14] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: new kernel, round deux [21:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:58:22] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: new kernel round deux [21:58:24] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: new kernel round deux [21:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:01] (03PS1) 10Jforrester: api: Add support for linksmigration in ApiQueryLinks [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790427 (https://phabricator.wikimedia.org/T304780) [22:06:08] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:08:41] (03PS2) 10Jforrester: [Beta Cluster] LabsServices: Use https for swift [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684012 (https://phabricator.wikimedia.org/T277990) (owner: 10Majavah) [22:08:51] (03CR) 10jerkins-bot: [V: 04-1] [Beta Cluster] LabsServices: Use https for swift [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684012 (https://phabricator.wikimedia.org/T277990) (owner: 10Majavah) [22:10:11] (03PS3) 10Jforrester: [Beta Cluster] LabsServices: Use https for swift [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684012 (https://phabricator.wikimedia.org/T277990) (owner: 10Majavah) [22:12:39] (03CR) 10Dzahn: [C: 03+2] "I will revert it but i think we should actually delete the repos then." [puppet] - 10https://gerrit.wikimedia.org/r/789958 (https://phabricator.wikimedia.org/T307889) (owner: 10Hashar) [22:15:46] (03CR) 10JHathaway: [C: 03+1] "I really like this refactor, nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/790294 (owner: 10Jbond) [22:19:21] (03CR) 10JHathaway: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [22:20:05] (03CR) 10JHathaway: [C: 03+1] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [22:22:33] (03PS2) 10BryanDavis: striker: Add profile to provision docker container [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) [22:28:54] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:30:12] (03PS2) 10Zabe: icinga: migrate sync-icinga-state cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/780671 (https://phabricator.wikimedia.org/T273673) [22:30:18] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:35:18] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:39:14] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:43:50] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:48:41] (03CR) 10BryanDavis: [V: 04-1 C: 04-1] "Still not quite sure what is causing this PCC error:" [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:07:18] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale