[00:10:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42926 and previous config saved to /var/cache/conftool/dbconfig/20230106-001049-ladsgroup.json [00:18:58] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42927 and previous config saved to /var/cache/conftool/dbconfig/20230106-002556-ladsgroup.json [00:27:25] (03PS1) 10Andrew Bogott: Neutron: enable linuxbridge for Zed [puppet] - 10https://gerrit.wikimedia.org/r/876033 (https://phabricator.wikimedia.org/T323086) [00:28:37] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: enable linuxbridge for Zed [puppet] - 10https://gerrit.wikimedia.org/r/876033 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [00:29:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:30:14] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:41:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T326156)', diff saved to https://phabricator.wikimedia.org/P42928 and previous config saved to /var/cache/conftool/dbconfig/20230106-004102-ladsgroup.json [00:41:06] T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156 [00:46:42] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:27] jouncebot: nowandnext [00:52:27] No deployments scheduled for the next 6 hour(s) and 7 minute(s) [00:52:27] In 6 hour(s) and 7 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230106T0700) [00:55:04] (03PS1) 10Urbanecm: Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" [extensions/CentralAuth] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876051 (https://phabricator.wikimedia.org/T326377) [00:55:35] (03CR) 10Urbanecm: [C: 03+2] "backporting; making Special:GlobalRenameProgress work again" [extensions/CentralAuth] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876051 (https://phabricator.wikimedia.org/T326377) (owner: 10Urbanecm) [00:58:19] (03Merged) 10jenkins-bot: Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" [extensions/CentralAuth] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876051 (https://phabricator.wikimedia.org/T326377) (owner: 10Urbanecm) [00:59:14] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:876051|Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" (T326377 T312394)]] [00:59:21] T326377: Special:GlobalRenameProgress fails with "Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'metawiki.renameuser_status' doesn't exist" - https://phabricator.wikimedia.org/T326377 [00:59:22] T312394: Migrate usage of Database::select to SelectQueryBuilder in CentralAuth - https://phabricator.wikimedia.org/T312394 [01:01:01] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:876051|Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" (T326377 T312394)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [01:01:45] https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress no longer throws a DB error at the debug server, proceeding [01:02:08] I literally don't understand why. [01:02:54] zabe: me neither. i'd suggest a fix, but it looks like a mystery at this point. i don't want to keep it broken and annoy the renamers, so...reverting for now :) [01:03:30] sure [01:08:03] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:876051|Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" (T326377 T312394)]] (duration: 08m 48s) [01:08:07] T326377: Special:GlobalRenameProgress fails with "Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'metawiki.renameuser_status' doesn't exist" - https://phabricator.wikimedia.org/T326377 [01:08:07] T312394: Migrate usage of Database::select to SelectQueryBuilder in CentralAuth - https://phabricator.wikimedia.org/T312394 [01:13:54] * urbanecm leaves the fix in wmf.17-only (it doesn't pass CI) and records it as next week's train blocker [01:15:34] If I had to guess, I would say it is a problem in rdbms, rather than a problem with centralauth. [01:16:38] that's my guess too. but I have no idea why copy pasting identical code into shell.php works, and calling the method doesn't :-/ [01:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:42:46] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:46] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:48] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:18] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:51:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:52:28] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:53:52] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:55:18] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:58:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:58:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:27:46] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:46] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:32:54] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [04:34:28] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [05:22:14] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [05:23:50] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [05:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:47:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [06:48:58] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230106T0700) [07:58:26] (03PS1) 10Ayounsi: Revert "Revert "drmrs offload Vodafone from Tata"" [homer/public] - 10https://gerrit.wikimedia.org/r/876052 (https://phabricator.wikimedia.org/T324955) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230106T0800) [08:03:07] (03CR) 10Ayounsi: [C: 03+2] Revert "Revert "drmrs offload Vodafone from Tata"" [homer/public] - 10https://gerrit.wikimedia.org/r/876052 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi) [08:03:42] (03Merged) 10jenkins-bot: Revert "Revert "drmrs offload Vodafone from Tata"" [homer/public] - 10https://gerrit.wikimedia.org/r/876052 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi) [08:05:56] !log drmrs offload Vodafone from Tata - T324955 [08:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:57] PROBLEM - SSH on an-launcher1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:22:33] RECOVERY - SSH on an-launcher1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:37:41] I'm going to do another (last for a bit) round of mass peering request emails to 37 interesting DE-CIX Marseille peers (not contacted yet), it's going to cause some noise in here, please ignore [08:39:56] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16347 [08:40:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16347 [08:40:17] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 42473 [08:41:09] (03CR) 10Hashar: [C: 03+1] phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn) [08:41:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 42473 [08:41:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 132602 [08:42:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 132602 [08:42:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35432 [08:43:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35432 [08:43:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 51254 [08:44:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 51254 [08:44:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58715 [08:45:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58715 [08:45:09] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 22822 [08:47:38] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 22822 [08:47:39] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 47794 [08:47:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 47794 [08:47:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 48237 [08:49:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 48237 [08:49:06] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 39405 [08:49:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 39405 [08:49:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 21320 [08:50:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 21320 [08:50:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 61573 [08:50:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61573 [08:50:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 41095 [08:51:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 41095 [08:51:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 13113 [08:52:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13113 [08:52:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37558 [08:52:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37558 [08:52:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37282 [08:52:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37282 [08:52:46] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 21245 [08:53:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 21245 [08:53:09] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 56630 [08:53:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56630 [08:53:34] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 327700 [08:53:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 327700 [08:53:45] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 62597 [08:54:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 62597 [08:54:47] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 201746 [08:55:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 201746 [08:55:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 51185 [08:55:38] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 51185 [08:55:39] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 263237 [08:55:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263237 [08:55:59] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 64049 [08:57:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 64049 [08:57:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9119 [08:57:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9119 [08:57:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 24482 [08:59:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 24482 [08:59:28] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45489 [09:00:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45489 [09:00:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58717 [09:00:41] PROBLEM - SSH on an-launcher1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:01:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58717 [09:01:18] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 60427 [09:01:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 60427 [09:01:38] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15954 [09:01:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15954 [09:01:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 32035 [09:02:13] RECOVERY - SSH on an-launcher1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:02:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 32035 [09:02:19] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4788 [09:03:42] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 4788 [09:03:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37473 [09:04:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37473 [09:04:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 5713 [09:05:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 5713 [09:05:22] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9038 [09:05:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9038 [09:05:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 266925 [09:06:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 266925 [09:06:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 36994 [09:06:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36994 [09:08:19] PROBLEM - SSH on an-launcher1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:09:49] RECOVERY - SSH on an-launcher1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:14:01] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: xlation-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:28:17] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:32:17] that alert is still broken ;) [09:33:03] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:34:37] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:42:37] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:44:13] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:50:21] (03PS1) 10Jelto: gitlab_runner: lower docker_gc watermarks in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) [10:01:47] (03CR) 10Jelto: "It seems the docker::gc job is not doing any cleanup due to quite high watermarks. I lowered the watermarks to start cleanup a little earl" [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto) [10:10:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 21245 [10:10:55] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 21245 [10:32:00] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Volans) [10:32:45] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Volans) p:05Triage→03Medium [10:38:17] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [10:39:47] 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10LSobanski) 05Open→03Stalled p:05Triage→03Medium [10:39:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10LSobanski) [10:39:50] (03CR) 10Jbond: [C: 04-1] "lgtm other then the issue highlighted" [puppet] - 10https://gerrit.wikimedia.org/r/875897 (owner: 10Effie Mouzeli) [10:43:01] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [10:44:33] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:45:15] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Urbanecm) This has my +1, Zabe's deployment access would help him in his work in many areas. Thanks for volunteering! [10:47:59] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 159 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:49:05] looking [10:49:35] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 43 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:50:53] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [10:50:57] lots of Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array [10:52:29] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [10:59:19] one of the spike was `[{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryTimeoutError: A database query timeout has occurred. Query: SET STATEMENT max_statement_time=30 FOR SELECT /*! STRAIGHT_JOIN */ actor_name,actor_user,rc_actor,rc_id,rc_timestamp,rc_namespace,rc_title,rc` [10:59:40] some others were Parsoid timing out for some pages on enwiki [10:59:54] they don't seem to be too problematic [11:00:44] (03CR) 10Jbond: phabricator: change phd home dir to /var/lib/phd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [11:01:27] 10SRE-OnFire, 10Data-Engineering-Planning, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10EChetty) [11:02:15] 10SRE-OnFire, 10Data-Engineering-Planning, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10EChetty) [11:02:50] (03CR) 10Jbond: [C: 03+1] Rename alias [puppet] - 10https://gerrit.wikimedia.org/r/875971 (owner: 10Muehlenhoff) [11:03:25] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn) [11:05:34] (03CR) 10Jbond: "lgtm" [software/cumin] - 10https://gerrit.wikimedia.org/r/875985 (owner: 10Volans) [11:06:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cumin] - 10https://gerrit.wikimedia.org/r/875986 (owner: 10Volans) [11:37:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1130.eqiad.wmnet with reason: Maintenance [11:38:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1130.eqiad.wmnet with reason: Maintenance [11:38:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2113.codfw.wmnet with reason: Maintenance [11:38:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2113.codfw.wmnet with reason: Maintenance [11:38:44] !log upload bgpalerter to bullseye-wikimedia [11:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:24] (03PS1) 10Jbond: bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191 [11:51:01] (03CR) 10Jbond: [C: 03+2] bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191 (owner: 10Jbond) [11:51:38] (03CR) 10CI reject: [V: 04-1] bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191 (owner: 10Jbond) [11:52:53] (03PS2) 10Jbond: bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191 [11:57:52] (03CR) 10Jbond: [C: 03+2] bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191 (owner: 10Jbond) [12:04:47] 10SRE, 10Infrastructure-Foundations, 10Kubernetes, 10Security: Network segmentation for WMF servers - https://phabricator.wikimedia.org/T101912 (10LSobanski) [12:08:16] 10SRE, 10MediaWiki-Shell, 10WMF-General-or-Unknown, 10Security, 10Sustainability (Incident Followup): Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10LSobanski) [12:14:57] 10SRE, 10Security: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240 (10LSobanski) [12:15:18] 10SRE, 10Infrastructure-Foundations, 10Kubernetes, 10Security: Network segmentation for WMF servers - https://phabricator.wikimedia.org/T101912 (10LSobanski) [12:19:46] (03PS2) 10Stevemunene: Bump up mediawiki_history_snapshot to 2022-12 [puppet] - 10https://gerrit.wikimedia.org/r/875364 (owner: 10Mforns) [12:20:25] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:21:00] (03CR) 10Stevemunene: [C: 03+2] Bump up mediawiki_history_snapshot to 2022-12 [puppet] - 10https://gerrit.wikimedia.org/r/875364 (owner: 10Mforns) [12:23:37] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:28:33] (03PS1) 10Jbond: bgpalerter: update binary path [puppet] - 10https://gerrit.wikimedia.org/r/876192 [12:29:04] !log stevemunene@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [12:29:30] (03CR) 10Jbond: [C: 03+2] bgpalerter: update binary path [puppet] - 10https://gerrit.wikimedia.org/r/876192 (owner: 10Jbond) [12:35:53] 10SRE, 10Infrastructure-Foundations, 10vm-requests: EQIAD: 1 VM request for idm-test - https://phabricator.wikimedia.org/T326406 (10SLyngshede-WMF) [12:36:03] 10SRE, 10Infrastructure-Foundations, 10vm-requests: EQIAD: 1 VM request for idm-test - https://phabricator.wikimedia.org/T326406 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:36:14] !log running extensions/SecurePoll/cli/wm-scripts/ucoc2023/ucoc2023_tables.sql on each wiki [12:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:19] 10SRE, 10Infrastructure-Foundations, 10vm-requests: EQIAD: 1 VM request for idm-test - https://phabricator.wikimedia.org/T326406 (10SLyngshede-WMF) p:05Triage→03Low [12:42:47] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [12:49:30] 10SRE-OnFire, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10EChetty) [12:50:14] 10SRE-OnFire, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10EChetty) [12:59:40] 10SRE: docker-registry.wikimedia.org/golang:1.11 should no more depends on stretch-backports - https://phabricator.wikimedia.org/T261920 (10LSobanski) 05Open→03Resolved a:03LSobanski Based on P11925#75528 and the fact that there is a Buster version of golang 1.11 (1.11.6-1+deb10u4), I think this can be res... [12:59:47] 10SRE, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10LSobanski) [13:02:25] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10LSobanski) [13:04:33] 10SRE, 10Release Pipeline, 10serviceops, 10Epic, 10Release-Engineering-Team (Seen): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10LSobanski) [13:08:51] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10LSobanski) [13:17:39] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10LSobanski) [13:17:58] 10SRE, 10TimedMediaHandler-Transcode: Increase job runners on video scalers to maximize load efficiency - https://phabricator.wikimedia.org/T201358 (10LSobanski) 05Open→03Resolved a:03LSobanski [13:19:19] (03PS4) 10Jelto: gitlab: stop using "latest" backup name [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) [13:20:22] 10SRE, 10DBA, 10MediaWiki-libs-Rdbms, 10Patch-For-Review, 10Performance-Team (Radar): Check if setBigSelects() is still needed - https://phabricator.wikimedia.org/T325610 (10LSobanski) [13:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:33:29] (03CR) 10David Caro: [C: 03+1] "Got a question there, but looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [13:34:57] (03CR) 10Majavah: openstack: encapi: open up write access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah) [13:51:59] (03PS5) 10Jelto: gitlab: stop using "latest" backup name [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) [13:53:42] (03PS1) 10Stang: zhwiki: Install PageAssessments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876196 [13:53:57] (03PS1) 10Reedy: wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876056 (https://phabricator.wikimedia.org/T326408) [13:54:05] (03CR) 10Reedy: [C: 03+2] wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876056 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy) [13:54:11] (03PS2) 10Stang: zhwiki: Install PageAssessments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876196 (https://phabricator.wikimedia.org/T326387) [13:55:45] (03CR) 10Reedy: [C: 03+2] wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/876057 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy) [13:56:51] (03Merged) 10jenkins-bot: wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876056 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy) [13:57:26] (03CR) 10CI reject: [V: 04-1] wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/876057 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy) [13:57:51] really phan [13:58:12] (03Abandoned) 10Reedy: wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/876057 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy) [14:01:41] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:05:15] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:22] !log reedy@deploy1002 Synchronized php-1.40.0-wmf.17/extensions/SecurePoll/cli/wm-scripts/ucoc2023/populateEditCount.php: T326408 (duration: 07m 09s) [14:06:25] 10SRE, 10Performance-Team (Radar): unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10LSobanski) @Joe's patch mentioned above has been merged in Feb 2021 and the hardcoded IP config has since been moved to monitoring.pp. @Cdanis can the entry be remov... [14:06:25] T326408: Flow edit count isn't getting Flow database correctly - https://phabricator.wikimedia.org/T326408 [14:10:47] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:15:04] (03PS28) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [14:15:06] (03PS1) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) [14:15:34] (03CR) 10Ottomata: "Already reviewed in I74ae11d8604be5bb5ce9cdb41c5e51aae38f4723" [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:16:18] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:20:53] (03CR) 10Hashar: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [14:21:44] (03Merged) 10jenkins-bot: flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:23:45] (03PS29) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [14:23:56] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:24:04] (03PS23) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [14:24:14] (03PS2) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) [14:24:57] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:28:31] (03Merged) 10jenkins-bot: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:28:49] (03CR) 10CI reject: [V: 04-1] flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:30:09] (03PS1) 10Ottomata: Bump flink-kubernetes-operator chart versions to 1.3.0 to match image and upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/876203 (https://phabricator.wikimedia.org/T324576) [14:30:52] (03PS2) 10Ottomata: Bump flink-kubernetes-operator chart versions to 1.3.0 to match image and upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/876203 (https://phabricator.wikimedia.org/T324576) [14:31:43] (03CR) 10Ottomata: [C: 03+2] Bump flink-kubernetes-operator chart versions to 1.3.0 to match image and upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/876203 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:36:59] (03Merged) 10jenkins-bot: Bump flink-kubernetes-operator chart versions to 1.3.0 to match image and upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/876203 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:38:49] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Jclark-ctr) Sorry did not give update. Case# 159648923 was submitted 1/4/2023 Idrac was not reachable remotely. Reset Idrac with crash cart 1/6/2023 TSR... [14:38:59] (03PS13) 10Hashar: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) [14:39:08] (03PS5) 10Hashar: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) [14:39:35] 10SRE, 10Traffic, 10Upstream: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) Update from the maintainer: the package is no longer being maintained in Debian so we will build our own. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027994#10 > On Fri, Jan... [14:42:01] (03CR) 10JMeybohm: flink and flink-kubernetes-operator image (0313 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [14:42:26] !log remove bgpalerter from apt [14:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:44] (03PS24) 10JMeybohm: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:43:34] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:07:13] !log depool cp5032 for bullseye upgrade (starting with NIC firmware upgrade): T325797 [15:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:17] T325797: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 [15:07:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5032.eqsin.wmnet,service=cdn [15:07:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5032.eqsin.wmnet,service=ats-be [15:08:11] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp5032.eqsin.wmnet [15:08:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts cp5032.eqsin.wmnet [15:10:15] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS bullseye [15:10:21] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye [15:16:23] (03PS1) 10Ssingh: hiera: cp5032: do not set use_linux510_on_buster [puppet] - 10https://gerrit.wikimedia.org/r/876206 (https://phabricator.wikimedia.org/T325797) [15:17:41] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38989/console" [puppet] - 10https://gerrit.wikimedia.org/r/876206 (https://phabricator.wikimedia.org/T325797) (owner: 10Ssingh) [15:18:28] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: cp5032: do not set use_linux510_on_buster [puppet] - 10https://gerrit.wikimedia.org/r/876206 (https://phabricator.wikimedia.org/T325797) (owner: 10Ssingh) [15:24:27] (03PS1) 10Ssingh: install_server: remove installation of linux-image-5.10-amd64 for cp[45]* [puppet] - 10https://gerrit.wikimedia.org/r/876207 [15:24:51] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38990/console" [puppet] - 10https://gerrit.wikimedia.org/r/876207 (owner: 10Ssingh) [15:26:48] (03CR) 10JMeybohm: [C: 03+1] wmfdebug 0.0.6: Include the wmf-certificates package (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 (owner: 10Ahmon Dancy) [15:27:49] (03PS2) 10Ssingh: install_server: remove installation of linux-image-5.10-amd64 for cp[45]* [puppet] - 10https://gerrit.wikimedia.org/r/876207 [15:28:48] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38991/console" [puppet] - 10https://gerrit.wikimedia.org/r/876207 (owner: 10Ssingh) [15:30:43] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5032.eqsin.wmnet with OS bullseye [15:30:47] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye executed with errors: - cp5032 (**FAIL**) - Downtimed on Icinga/Alertmanager - Disab... [15:30:51] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:13] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS bullseye [15:31:18] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye [15:32:21] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:34:39] (03PS20) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [15:35:31] (03PS21) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [15:36:40] (03PS22) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [15:37:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38994/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [15:39:00] (03CR) 10CI reject: [V: 04-1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [15:40:02] (03PS23) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [15:40:04] (03CR) 10Jbond: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [15:43:29] (03PS1) 10Hashar: wm-checks-api: fix TypeScript noImplicitAny [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/876212 [15:43:41] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:45:25] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:59] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:54:02] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1486.eqiad.wmnet [15:58:06] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) p:05Triage→03Low [15:58:45] (03PS2) 10Ahmon Dancy: wmfdebug 0.0.6: Include the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 [15:58:57] (03CR) 10Xcollazo: "(closing draft comments)" [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo) [15:59:02] (03CR) 10Ahmon Dancy: wmfdebug 0.0.6: Include the wmf-certificates package (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 (owner: 10Ahmon Dancy) [16:00:54] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) racadm getsel log: ` ------------------------------------------------------------------------------- Record: 5 Date/Time: 01/... [16:01:00] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) [16:02:11] (03CR) 10JMeybohm: flink-operator - add admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:02:54] (03PS3) 10JMeybohm: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:03:09] (03PS1) 10Bking: [WIP] wdqs-data-reload: use NFS for data reloads [cookbooks] - 10https://gerrit.wikimedia.org/r/876217 (https://phabricator.wikimedia.org/T323096) [16:03:29] (03CR) 10JMeybohm: flink-operator - add admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:04:20] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) Thanks for the update. I will extend the downtime to two weeks from now, will revisit if necessary. [16:05:44] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on parse1002.eqiad.wmnet with reason: CPU1 machine check error [16:05:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on parse1002.eqiad.wmnet with reason: CPU1 machine check error [16:05:50] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5c4b686a-9560-44c1-acb3-c16978d72b37) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 1... [16:06:23] (03CR) 10Xcollazo: [C: 03+1] "Now that we have conda analytics deployed, and there is a user deadline for moving from anaconda-wmf, do we still need this patch or can w" [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/780898 (https://phabricator.wikimedia.org/T306197) (owner: 10Ottomata) [16:10:34] (03CR) 10Ottomata: flink and flink-kubernetes-operator image (0314 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [16:12:00] (03PS15) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [16:14:56] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:15:22] (03CR) 10JMeybohm: [C: 03+1] wmfdebug 0.0.6: Include the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 (owner: 10Ahmon Dancy) [16:15:46] (03CR) 10Ssingh: [V: 03+1] "Execution of preseeded command "wget -O /tmp/late_command │" [puppet] - 10https://gerrit.wikimedia.org/r/876207 (owner: 10Ssingh) [16:16:49] (03CR) 10Ssingh: [V: 03+1 C: 03+2] install_server: remove installation of linux-image-5.10-amd64 for cp[45]* [puppet] - 10https://gerrit.wikimedia.org/r/876207 (owner: 10Ssingh) [16:17:28] (03CR) 10Ottomata: "Interesting, because I don't include all of the expected scaffold templates, some o the test case fixtures are failing." [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:18:01] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5032.eqsin.wmnet with OS bullseye [16:18:06] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye executed with errors: - cp5032 (**FAIL**) - Removed from Puppet and PuppetDB if presen... [16:21:02] (03PS4) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) [16:21:16] (03CR) 10CI reject: [V: 04-1] flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:26:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS bullseye [16:26:08] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye [16:29:55] (03PS5) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) [16:32:42] (03CR) 10Dzahn: [C: 03+2] admin: add data types to validate UIDs [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn) [16:33:00] (03PS25) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [16:33:56] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:34:35] (03CR) 10Dzahn: [C: 03+1] "nit: please add what the actual problem was that was fixed, but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869717 (owner: 10AOkoth) [16:36:39] (03PS1) 10MVernon: thanos: drain thanos-be[1,2]004 [puppet] - 10https://gerrit.wikimedia.org/r/876221 (https://phabricator.wikimedia.org/T279621) [16:38:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:16] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/876221 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:43:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:51] (03CR) 10JMeybohm: flink and flink-kubernetes-operator image (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [16:48:13] PROBLEM - mediawiki-installation DSH group on mw1486 is CRITICAL: Host mw1486 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:53:25] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5032.eqsin.wmnet with OS bullseye [16:53:29] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye executed with errors: - cp5032 (**FAIL**) - Removed from Puppet and PuppetDB if presen... [16:53:41] (03PS26) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [16:53:53] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:53:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS bullseye [16:58:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) Thank you for deploying will investigate today while on site [17:06:09] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:48] (03Abandoned) 10Ottomata: Actually set REQUESTS_CA_BUNDLE [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/780898 (https://phabricator.wikimedia.org/T306197) (owner: 10Ottomata) [17:19:33] (03CR) 10Dzahn: [C: 03+1] gitlab_runner: lower docker_gc watermarks in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto) [17:25:25] (03PS1) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [17:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:26:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage [17:29:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage [17:29:54] (03CR) 10Dzahn: [C: 04-1] "Systemd::Sysuser[vcs]: has no parameter named 'gid' - https://puppet-compiler.wmflabs.org/output/865207/38996/phab1004.eqiad.wmnet/change." [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn) [17:42:02] (03PS4) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207 [17:48:47] (03PS1) 10Btullis: Detect the correct disks for the O/S on the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670) [17:54:10] (03CR) 10Ahmon Dancy: "There's more to be done to handle the runner-*-concurrent-2-cache-* volumes. I'll work on a separate commit for that." [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto) [17:58:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5032.eqsin.wmnet with OS bullseye [18:00:11] (03PS7) 10Raymond Ndibe: tools-webservice: read buildservice_repository from webservice.yaml config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) [18:01:02] (03CR) 10Raymond Ndibe: tools-webservice: read buildservice_repository from webservice.yaml config file (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [18:01:20] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: lower docker_gc watermarks in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto) [18:05:42] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: lower docker_gc watermarks in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto) [18:06:36] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: lower docker_gc watermarks in wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto) [18:06:58] (03PS1) 10Ahmon Dancy: Make gitlab-runner cache volumes eligible for docker-gc [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) [18:13:25] !log krinkle@cloudweb1003$ Run `UPDATE actor SET actor_user=31136 WHERE actor_id=14640;` to partially fix T326431 [18:13:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) Created ticket Confirmed: Service Request 159722060 was successfully submitted. Submitted TSR report to Dell [18:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:29] T326431: Some system users have invalid 'actor' database rows - https://phabricator.wikimedia.org/T326431 [18:16:04] PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100% [18:17:53] ^ seems to be expected from SAL? [18:18:28] (03CR) 10Dzahn: [C: 03+2] "deployed on runner-1026.the docker commandline has been changed there accordingly" [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto) [18:18:43] T326425, I guess [18:18:43] T326425: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 [18:19:31] sukhe: maybe worth a downtime if it’s out of service [18:19:36] yeah [18:19:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Dzahn) 18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on mw1486.eqiad.wmnet with reason: downtimed, hw failure: T326425 [18:20:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on mw1486.eqiad.wmnet with reason: downtimed, hw failure: T326425 [18:21:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:21:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Dzahn) 05Open→03In progress [18:21:41] Thanks sukhe [18:21:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:22:51] thanks all [18:23:38] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:24:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.557 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:25:04] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:33:40] (03CR) 10Dzahn: "I was looking for the upstream man page / docs for the --volume-filter parameter but somehow it wasn't obvious where that is. not in my lo" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [18:34:07] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5032.eqsin.wmnet,service=cdn [18:34:08] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5032.eqsin.wmnet,service=ats-be [18:35:49] (03CR) 10Dzahn: "https://docs.docker.com/search/?q=volume-filter" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [18:36:12] !log pool cp5032 [bullseye upgrade completed]: T325797 [18:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:15] T325797: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 [18:36:42] (03CR) 10Ahmon Dancy: Make gitlab-runner cache volumes eligible for docker-gc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [18:39:53] (03CR) 10Dzahn: [C: 03+1] "thanks! lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [18:41:39] (03CR) 10Dzahn: [C: 03+2] Make gitlab-runner cache volumes eligible for docker-gc [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [18:42:20] (03CR) 10Dzahn: [C: 03+2] "confirmed docker::gc is only used in gitlab::runner profile" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [18:47:58] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:00] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:04] (03CR) 10Dzahn: [C: 03+2] "deployed. --volume-filter 'label:com.gitlab.gitlab-runner.type=cache' has been added to the docker commandline on gitlab-runner* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [18:48:04] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:12] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:19] well, that would be me then [18:48:21] (03CR) 10Ahmon Dancy: "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [18:48:36] (03CR) 10Dzahn: [C: 03+2] "well.. PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [18:49:03] I'm around for debugging [18:49:09] dancy: Jan 06 18:48:34 gitlab-runner1002 docker[1218708]: Invalid filter: 'label:com.gitlab.gitlab-runner.type> [18:49:10] PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:12] PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:37] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: debugging [18:49:41] hmm.. I'll look into that. Feel free to revert in the meantime. [18:49:46] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:49] Jan 06 18:44:40 gitlab-runner2002 docker[1216220]: Invalid filter: 'label:com.gitlab.gitlab-runner.type=cache' [18:49:51] dancy: 30 min downtime :) [18:49:59] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: debugging [18:50:16] sukhe: thanks! yea, we just added that filter thing to the commandline [18:50:21] something about the format [18:50:39] mutante: let us know if you need an extra pair of eyes (on-call person :) [18:51:18] ok. needs to be ==, not = [18:51:28] fixing... [18:51:34] sukhe: thank you! I don't see any production services being affected. it's just that garbage collection won't run. and it's downtimed and will be fixed very soon :) [18:51:42] ok! [18:52:10] mutante: had a clean week so just being careful :P [18:52:35] sukhe: appreciate the reaction to icinga alerts :)) [18:52:58] (03PS1) 10Ahmon Dancy: Followup 3221b39a5736dd99befd5b72618c7c854dfb5252 [puppet] - 10https://gerrit.wikimedia.org/r/876247 [18:53:07] also it was nice that downtime cookbook takes wildcard in hostname [18:54:17] (03PS2) 10Dzahn: docker::gc: fix syntax for volume-filter [puppet] - 10https://gerrit.wikimedia.org/r/876247 (owner: 10Ahmon Dancy) [18:54:36] (03CR) 10Dzahn: [C: 03+2] docker::gc: fix syntax for volume-filter [puppet] - 10https://gerrit.wikimedia.org/r/876247 (owner: 10Ahmon Dancy) [18:54:57] (03CR) 10Dzahn: [V: 03+2 C: 03+2] docker::gc: fix syntax for volume-filter [puppet] - 10https://gerrit.wikimedia.org/r/876247 (owner: 10Ahmon Dancy) [18:56:53] !log gitlab-runner1002 - systemctl start docker-gc; run puppet on all gitlab-runners T310593 [18:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:57] T310593: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 [18:57:02] RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:04] RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:24] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:28] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:30] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:42] !log systemctl start docker-gc on all gitlab-runners via cumin T310593 [18:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:48] dancy: all good [18:57:57] sukhe: fixed [18:58:07] Great. Sorry for the noise [18:58:15] no problem at all [18:58:34] thakns all <3 [18:59:13] (03CR) 10Dzahn: [C: 03+2] "fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/876247" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy) [19:00:38] (03CR) 10Raymond Ndibe: [C: 03+2] tools-webservice: read buildservice_repository from webservice.yaml config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [19:02:18] (03Merged) 10jenkins-bot: tools-webservice: read buildservice_repository from webservice.yaml config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [19:05:32] (03CR) 10Dzahn: ": parameter 'id' expects a Systemd::Sysuser::Id = Variant[Integer[0], Enum['-'], Stdlib::Unixpath = Pattern[/\A\/([^\n\/\0]+\/*)*\z/], Pat" [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn) [19:13:42] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:15:18] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:16:07] (03PS1) 10Southparkfan: rsyslog: allow subject name validation [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) [19:19:45] (03PS5) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207 [19:21:38] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:23:12] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:24:02] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:25:01] * sukhe sings a song to bring it down to lt 100 [19:25:38] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 41 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:27:23] fixed by (wiki whisperer) sukhe [19:28:12] (03CR) 10Dzahn: [C: 04-1] "parameter 'additional_groups' expects an Array value MEEEP" [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn) [19:28:44] (03PS6) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207 [19:30:20] :D [19:34:37] (03PS1) 10Ottomata: Update flink-kubernetes-operator chart with upstream changes for 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519) [19:43:39] (03PS2) 10Ottomata: Update flink-kubernetes-operator chart with upstream changes for 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519) [19:45:44] (03PS1) 10Southparkfan: profile::base: fix hiera key name fox tls_client_auth [puppet] - 10https://gerrit.wikimedia.org/r/876251 (https://phabricator.wikimedia.org/T127717) [19:46:25] (03PS2) 10Southparkfan: profile::base: fix hiera key name for tls_client_auth [puppet] - 10https://gerrit.wikimedia.org/r/876251 (https://phabricator.wikimedia.org/T127717) [19:47:11] (03CR) 10Southparkfan: rsyslog: allow subject name validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [19:47:56] (03CR) 10Southparkfan: rsyslog: allow subject name validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [19:50:37] (03PS16) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [19:51:08] (03CR) 10Ottomata: flink and flink-kubernetes-operator image (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [19:51:50] (03CR) 10Ottomata: flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [19:57:38] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:59:10] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:05:42] (03PS27) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [20:10:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:13:36] (03CR) 10Ottomata: flink-app chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [20:15:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:52:28] (03CR) 10Dzahn: [C: 03+2] "well, now we have a range up to 499 and one starting at 1000 but the first one I wanted to use it with, phd, is 920." [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn) [23:03:44] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:36] (03PS9) 10Krinkle: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [23:19:58] (03PS10) 10Krinkle: doc: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [23:21:48] (03PS11) 10Krinkle: doc: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [23:21:57] (03CR) 10Krinkle: [C: 03+1] doc: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [23:23:50] (03CR) 10Krinkle: [C: 03+1] "Scheduled for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700" [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [23:23:55] (03CR) 10Krinkle: "Scheduled for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700" [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) (owner: 10Krinkle) [23:29:42] (03CR) 10Krinkle: [C: 03+1] doc: Relax CSP rules for taint-check-demo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy)