[00:10:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42926 and previous config saved to /var/cache/conftool/dbconfig/20230106-001049-ladsgroup.json
[00:18:58] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:25:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42927 and previous config saved to /var/cache/conftool/dbconfig/20230106-002556-ladsgroup.json
[00:27:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Neutron: enable linuxbridge for Zed [puppet] - 10https://gerrit.wikimedia.org/r/876033 (https://phabricator.wikimedia.org/T323086)
[00:28:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Neutron: enable linuxbridge for Zed [puppet] - 10https://gerrit.wikimedia.org/r/876033 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott)
[00:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificaterequests) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:30:14] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificaterequests) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:41:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T326156)', diff saved to https://phabricator.wikimedia.org/P42928 and previous config saved to /var/cache/conftool/dbconfig/20230106-004102-ladsgroup.json
[00:41:06] <stashbot>	 T326156: Fix CreditsSource drifts - https://phabricator.wikimedia.org/T326156
[00:46:42] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:52:27] <urbanecm>	 jouncebot: nowandnext
[00:52:27] <jouncebot>	 No deployments scheduled for the next 6 hour(s) and 7 minute(s)
[00:52:27] <jouncebot>	 In 6 hour(s) and 7 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230106T0700)
[00:55:04] <wikibugs>	 (03PS1) 10Urbanecm: Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" [extensions/CentralAuth] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876051 (https://phabricator.wikimedia.org/T326377)
[00:55:35] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "backporting; making Special:GlobalRenameProgress work again" [extensions/CentralAuth] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876051 (https://phabricator.wikimedia.org/T326377) (owner: 10Urbanecm)
[00:58:19] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" [extensions/CentralAuth] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876051 (https://phabricator.wikimedia.org/T326377) (owner: 10Urbanecm)
[00:59:14] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:876051|Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" (T326377 T312394)]]
[00:59:21] <stashbot>	 T326377: Special:GlobalRenameProgress fails with "Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'metawiki.renameuser_status' doesn't exist" - https://phabricator.wikimedia.org/T326377
[00:59:22] <stashbot>	 T312394: Migrate usage of Database::select to SelectQueryBuilder in CentralAuth - https://phabricator.wikimedia.org/T312394
[01:01:01] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:876051|Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" (T326377 T312394)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[01:01:45] <urbanecm>	 https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress no longer throws a DB error at the debug server, proceeding
[01:02:08] <zabe>	 I literally don't understand why.
[01:02:54] <urbanecm>	 zabe: me neither. i'd suggest a fix, but it looks like a mystery at this point. i don't want to keep it broken and annoy the renamers, so...reverting for now :)
[01:03:30] <zabe>	 sure
[01:08:03] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:876051|Revert "GlobalRename: Convert DB selects to use SelectQueryBuilder" (T326377 T312394)]] (duration: 08m 48s)
[01:08:07] <stashbot>	 T326377: Special:GlobalRenameProgress fails with "Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'metawiki.renameuser_status' doesn't exist" - https://phabricator.wikimedia.org/T326377
[01:08:07] <stashbot>	 T312394: Migrate usage of Database::select to SelectQueryBuilder in CentralAuth - https://phabricator.wikimedia.org/T312394
[01:13:54] * urbanecm leaves the fix in wmf.17-only (it doesn't pass CI) and records it as next week's train blocker
[01:15:34] <zabe>	 If I had to guess, I would say it is a problem in rdbms, rather than a problem with centralauth.
[01:16:38] <urbanecm>	 that's my guess too. but I have no idea why copy pasting identical code into shell.php works, and calling the method doesn't :-/
[01:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:42:46] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:57:46] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:46] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:46] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:30:48] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:18] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:51:46] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:52:28] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:53:52] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:55:18] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:58:04] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:58:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:27:46] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:34:46] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:32:54] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[04:34:28] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[05:22:14] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[05:23:50] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[05:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:47:30] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[06:48:58] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230106T0700)
[07:58:26] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Revert "drmrs offload Vodafone from Tata"" [homer/public] - 10https://gerrit.wikimedia.org/r/876052 (https://phabricator.wikimedia.org/T324955)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230106T0800)
[08:03:07] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Revert "drmrs offload Vodafone from Tata"" [homer/public] - 10https://gerrit.wikimedia.org/r/876052 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi)
[08:03:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "drmrs offload Vodafone from Tata"" [homer/public] - 10https://gerrit.wikimedia.org/r/876052 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi)
[08:05:56] <XioNoX>	 !log drmrs offload Vodafone from Tata - T324955
[08:05:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:57] <icinga-wm>	 PROBLEM - SSH on an-launcher1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:22:33] <icinga-wm>	 RECOVERY - SSH on an-launcher1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:37:41] <XioNoX>	 I'm going to do another (last for a bit) round of mass peering request emails to 37 interesting DE-CIX Marseille peers (not contacted yet), it's going to cause some noise in here, please ignore
[08:39:56] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16347
[08:40:16] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16347
[08:40:17] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 42473
[08:41:09] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn)
[08:41:14] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 42473
[08:41:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 132602
[08:42:28] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 132602
[08:42:29] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35432
[08:43:25] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35432
[08:43:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 51254
[08:44:25] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 51254
[08:44:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58715
[08:45:08] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58715
[08:45:09] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 22822
[08:47:38] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 22822
[08:47:39] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 47794
[08:47:52] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 47794
[08:47:52] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 48237
[08:49:05] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 48237
[08:49:06] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 39405
[08:49:41] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 39405
[08:49:42] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 21320
[08:50:11] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 21320
[08:50:12] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 61573
[08:50:25] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61573
[08:50:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 41095
[08:51:58] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 41095
[08:51:59] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 13113
[08:52:09] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13113
[08:52:10] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37558
[08:52:23] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37558
[08:52:24] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37282
[08:52:45] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37282
[08:52:46] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 21245
[08:53:08] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 21245
[08:53:09] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 56630
[08:53:33] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56630
[08:53:34] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 327700
[08:53:44] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 327700
[08:53:45] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 62597
[08:54:46] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 62597
[08:54:47] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 201746
[08:55:13] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 201746
[08:55:13] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 51185
[08:55:38] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 51185
[08:55:39] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 263237
[08:55:58] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263237
[08:55:59] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 64049
[08:57:12] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 64049
[08:57:13] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9119
[08:57:25] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9119
[08:57:26] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 24482
[08:59:27] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 24482
[08:59:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45489
[09:00:26] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45489
[09:00:27] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58717
[09:00:41] <icinga-wm>	 PROBLEM - SSH on an-launcher1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:01:17] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58717
[09:01:18] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 60427
[09:01:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 60427
[09:01:38] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15954
[09:01:52] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15954
[09:01:53] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 32035
[09:02:13] <icinga-wm>	 RECOVERY - SSH on an-launcher1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:02:18] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 32035
[09:02:19] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4788
[09:03:42] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 4788
[09:03:43] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 37473
[09:04:53] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 37473
[09:04:54] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 5713
[09:05:21] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 5713
[09:05:22] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9038
[09:05:41] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9038
[09:05:42] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 266925
[09:06:24] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 266925
[09:06:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 36994
[09:06:59] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36994
[09:08:19] <icinga-wm>	 PROBLEM - SSH on an-launcher1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:09:49] <icinga-wm>	 RECOVERY - SSH on an-launcher1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:14:01] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: xlation-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[09:28:17] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[09:32:17] <hashar>	 that alert is still broken ;)
[09:33:03] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[09:34:37] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[09:42:37] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[09:44:13] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[09:50:21] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: lower docker_gc watermarks in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378)
[10:01:47] <wikibugs>	 (03CR) 10Jelto: "It seems the docker::gc job is not doing any cleanup due to quite high watermarks. I lowered the watermarks to start cleanup a little earl" [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto)
[10:10:07] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 21245
[10:10:55] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 21245
[10:32:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Volans)
[10:32:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Volans) p:05Triage→03Medium
[10:38:17] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[10:39:47] <wikibugs>	 10SRE, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10LSobanski) 05Open→03Stalled p:05Triage→03Medium
[10:39:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10LSobanski)
[10:39:50] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "lgtm other then the issue highlighted" [puppet] - 10https://gerrit.wikimedia.org/r/875897 (owner: 10Effie Mouzeli)
[10:43:01] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[10:44:33] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[10:45:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Zabe - https://phabricator.wikimedia.org/T326327 (10Urbanecm) This has my +1, Zabe's deployment access would help him in his work in many areas. Thanks for volunteering!
[10:47:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 159 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:49:05] <hashar>	 looking
[10:49:35] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 43 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:50:53] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[10:50:57] <claime>	 lots of Error: Cannot use object of type ContentTranslation\DTO\TranslationUnitDTO as array
[10:52:29] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[10:59:19] <hashar>	 one of the spike was `[{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryTimeoutError: A database query timeout has occurred. Query: SET STATEMENT max_statement_time=30 FOR SELECT /*! STRAIGHT_JOIN */ actor_name,actor_user,rc_actor,rc_id,rc_timestamp,rc_namespace,rc_title,rc`
[10:59:40] <hashar>	 some others were Parsoid timing out for some pages on enwiki
[10:59:54] <hashar>	 they don't seem to be too problematic
[11:00:44] <wikibugs>	 (03CR) 10Jbond: phabricator: change phd home dir to /var/lib/phd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[11:01:27] <wikibugs>	 10SRE-OnFire, 10Data-Engineering-Planning, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10EChetty)
[11:02:15] <wikibugs>	 10SRE-OnFire, 10Data-Engineering-Planning, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10EChetty)
[11:02:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Rename alias [puppet] - 10https://gerrit.wikimedia.org/r/875971 (owner: 10Muehlenhoff)
[11:03:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn)
[11:05:34] <wikibugs>	 (03CR) 10Jbond: "lgtm" [software/cumin] - 10https://gerrit.wikimedia.org/r/875985 (owner: 10Volans)
[11:06:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cumin] - 10https://gerrit.wikimedia.org/r/875986 (owner: 10Volans)
[11:37:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[11:38:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[11:38:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2113.codfw.wmnet with reason: Maintenance
[11:38:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2113.codfw.wmnet with reason: Maintenance
[11:38:44] <jbond>	 !log upload bgpalerter to bullseye-wikimedia
[11:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:24] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191
[11:51:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191 (owner: 10Jbond)
[11:51:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191 (owner: 10Jbond)
[11:52:53] <wikibugs>	 (03PS2) 10Jbond: bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191
[11:57:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] bgpalerter: manage installing package [puppet] - 10https://gerrit.wikimedia.org/r/876191 (owner: 10Jbond)
[12:04:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Kubernetes, 10Security: Network segmentation for WMF servers - https://phabricator.wikimedia.org/T101912 (10LSobanski)
[12:08:16] <wikibugs>	 10SRE, 10MediaWiki-Shell, 10WMF-General-or-Unknown, 10Security, 10Sustainability (Incident Followup): Securing external binaries run by MediaWiki - https://phabricator.wikimedia.org/T172584 (10LSobanski)
[12:14:57] <wikibugs>	 10SRE, 10Security: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240 (10LSobanski)
[12:15:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Kubernetes, 10Security: Network segmentation for WMF servers - https://phabricator.wikimedia.org/T101912 (10LSobanski)
[12:19:46] <wikibugs>	 (03PS2) 10Stevemunene: Bump up mediawiki_history_snapshot to 2022-12 [puppet] - 10https://gerrit.wikimedia.org/r/875364 (owner: 10Mforns)
[12:20:25] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[12:21:00] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Bump up mediawiki_history_snapshot to 2022-12 [puppet] - 10https://gerrit.wikimedia.org/r/875364 (owner: 10Mforns)
[12:23:37] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[12:28:33] <wikibugs>	 (03PS1) 10Jbond: bgpalerter: update binary path [puppet] - 10https://gerrit.wikimedia.org/r/876192
[12:29:04] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[12:29:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] bgpalerter: update binary path [puppet] - 10https://gerrit.wikimedia.org/r/876192 (owner: 10Jbond)
[12:35:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: EQIAD: 1 VM request for idm-test - https://phabricator.wikimedia.org/T326406 (10SLyngshede-WMF)
[12:36:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: EQIAD: 1 VM request for idm-test - https://phabricator.wikimedia.org/T326406 (10SLyngshede-WMF) a:03SLyngshede-WMF
[12:36:14] <tzatziki>	 !log running extensions/SecurePoll/cli/wm-scripts/ucoc2023/ucoc2023_tables.sql on each wiki
[12:36:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: EQIAD: 1 VM request for idm-test - https://phabricator.wikimedia.org/T326406 (10SLyngshede-WMF) p:05Triage→03Low
[12:42:47] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[12:49:30] <wikibugs>	 10SRE-OnFire, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, 10Sustainability (Incident Followup): Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10EChetty)
[12:50:14] <wikibugs>	 10SRE-OnFire, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10serviceops, and 2 others: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10EChetty)
[12:59:40] <wikibugs>	 10SRE: docker-registry.wikimedia.org/golang:1.11 should no more depends on stretch-backports - https://phabricator.wikimedia.org/T261920 (10LSobanski) 05Open→03Resolved a:03LSobanski Based on P11925#75528 and the fact that there is a Buster version of golang 1.11 (1.11.6-1+deb10u4), I think this can be res...
[12:59:47] <wikibugs>	 10SRE, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10LSobanski)
[13:02:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10LSobanski)
[13:04:33] <wikibugs>	 10SRE, 10Release Pipeline, 10serviceops, 10Epic, 10Release-Engineering-Team (Seen): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10LSobanski)
[13:08:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10LSobanski)
[13:17:39] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10LSobanski)
[13:17:58] <wikibugs>	 10SRE, 10TimedMediaHandler-Transcode: Increase job runners on video scalers to maximize load efficiency - https://phabricator.wikimedia.org/T201358 (10LSobanski) 05Open→03Resolved a:03LSobanski
[13:19:19] <wikibugs>	 (03PS4) 10Jelto: gitlab: stop using "latest" backup name [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463)
[13:20:22] <wikibugs>	 10SRE, 10DBA, 10MediaWiki-libs-Rdbms, 10Patch-For-Review, 10Performance-Team (Radar): Check if setBigSelects() is still needed - https://phabricator.wikimedia.org/T325610 (10LSobanski)
[13:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:33:29] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Got a question there, but looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah)
[13:34:57] <wikibugs>	 (03CR) 10Majavah: openstack: encapi: open up write access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874813 (https://phabricator.wikimedia.org/T317478) (owner: 10Majavah)
[13:51:59] <wikibugs>	 (03PS5) 10Jelto: gitlab: stop using "latest" backup name [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463)
[13:53:42] <wikibugs>	 (03PS1) 10Stang: zhwiki: Install PageAssessments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876196
[13:53:57] <wikibugs>	 (03PS1) 10Reedy: wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876056 (https://phabricator.wikimedia.org/T326408)
[13:54:05] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876056 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy)
[13:54:11] <wikibugs>	 (03PS2) 10Stang: zhwiki: Install PageAssessments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/876196 (https://phabricator.wikimedia.org/T326387)
[13:55:45] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/876057 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy)
[13:56:51] <wikibugs>	 (03Merged) 10jenkins-bot: wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/876056 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy)
[13:57:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/876057 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy)
[13:57:51] <Reedy>	 really phan
[13:58:12] <wikibugs>	 (03Abandoned) 10Reedy: wm-scripts: Get Flow DB_REPLICA in a different way [extensions/SecurePoll] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/876057 (https://phabricator.wikimedia.org/T326408) (owner: 10Reedy)
[14:01:41] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:05:15] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:06:22] <logmsgbot>	 !log reedy@deploy1002 Synchronized php-1.40.0-wmf.17/extensions/SecurePoll/cli/wm-scripts/ucoc2023/populateEditCount.php: T326408 (duration: 07m 09s)
[14:06:25] <wikibugs>	 10SRE, 10Performance-Team (Radar): unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10LSobanski) @Joe's patch mentioned above has been merged in Feb 2021 and the hardcoded IP config has since been moved to monitoring.pp. @Cdanis can the entry be remov...
[14:06:25] <stashbot>	 T326408: Flow edit count isn't getting Flow database correctly - https://phabricator.wikimedia.org/T326408
[14:10:47] <icinga-wm>	 RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[14:15:04] <wikibugs>	 (03PS28) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[14:15:06] <wikibugs>	 (03PS1) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576)
[14:15:34] <wikibugs>	 (03CR) 10Ottomata: "Already reviewed in I74ae11d8604be5bb5ce9cdb41c5e51aae38f4723" [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:16:18] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:20:53] <wikibugs>	 (03CR) 10Hashar: phabricator: change phd home dir to /var/lib/phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar)
[14:21:44] <wikibugs>	 (03Merged) 10jenkins-bot: flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:23:45] <wikibugs>	 (03PS29) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[14:23:56] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:24:04] <wikibugs>	 (03PS23) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[14:24:14] <wikibugs>	 (03PS2) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576)
[14:24:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:28:31] <wikibugs>	 (03Merged) 10jenkins-bot: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:28:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:30:09] <wikibugs>	 (03PS1) 10Ottomata: Bump flink-kubernetes-operator chart versions to 1.3.0 to match image and upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/876203 (https://phabricator.wikimedia.org/T324576)
[14:30:52] <wikibugs>	 (03PS2) 10Ottomata: Bump flink-kubernetes-operator chart versions to 1.3.0 to match image and upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/876203 (https://phabricator.wikimedia.org/T324576)
[14:31:43] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Bump flink-kubernetes-operator chart versions to 1.3.0 to match image and upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/876203 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:36:59] <wikibugs>	 (03Merged) 10jenkins-bot: Bump flink-kubernetes-operator chart versions to 1.3.0 to match image and upstream version [deployment-charts] - 10https://gerrit.wikimedia.org/r/876203 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:38:49] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting:  CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Jclark-ctr) Sorry did not give update. Case# 159648923 was submitted 1/4/2023  Idrac was not reachable remotely.  Reset Idrac with crash cart 1/6/2023 TSR...
[14:38:59] <wikibugs>	 (03PS13) 10Hashar: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125)
[14:39:08] <wikibugs>	 (03PS5) 10Hashar: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125)
[14:39:35] <wikibugs>	 10SRE, 10Traffic, 10Upstream: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) Update from the maintainer: the package is no longer being maintained in Debian so we will build our own.  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027994#10  > On Fri, Jan...
[14:42:01] <wikibugs>	 (03CR) 10JMeybohm: flink and flink-kubernetes-operator image (0313 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[14:42:26] <jbond>	 !log remove bgpalerter from apt
[14:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:44] <wikibugs>	 (03PS24) 10JMeybohm: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[14:43:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:07:13] <sukhe>	 !log depool cp5032 for bullseye upgrade (starting with NIC firmware upgrade): T325797
[15:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:17] <stashbot>	 T325797: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797
[15:07:45] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5032.eqsin.wmnet,service=cdn
[15:07:45] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5032.eqsin.wmnet,service=ats-be
[15:08:11] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp5032.eqsin.wmnet
[15:08:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=False) upgrade firmware for hosts cp5032.eqsin.wmnet
[15:10:15] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS bullseye
[15:10:21] <wikibugs>	 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye
[15:16:23] <wikibugs>	 (03PS1) 10Ssingh: hiera: cp5032: do not set use_linux510_on_buster [puppet] - 10https://gerrit.wikimedia.org/r/876206 (https://phabricator.wikimedia.org/T325797)
[15:17:41] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38989/console" [puppet] - 10https://gerrit.wikimedia.org/r/876206 (https://phabricator.wikimedia.org/T325797) (owner: 10Ssingh)
[15:18:28] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: cp5032: do not set use_linux510_on_buster [puppet] - 10https://gerrit.wikimedia.org/r/876206 (https://phabricator.wikimedia.org/T325797) (owner: 10Ssingh)
[15:24:27] <wikibugs>	 (03PS1) 10Ssingh: install_server: remove installation of linux-image-5.10-amd64 for cp[45]* [puppet] - 10https://gerrit.wikimedia.org/r/876207
[15:24:51] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38990/console" [puppet] - 10https://gerrit.wikimedia.org/r/876207 (owner: 10Ssingh)
[15:26:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] wmfdebug 0.0.6: Include the wmf-certificates package (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 (owner: 10Ahmon Dancy)
[15:27:49] <wikibugs>	 (03PS2) 10Ssingh: install_server: remove installation of linux-image-5.10-amd64 for cp[45]* [puppet] - 10https://gerrit.wikimedia.org/r/876207
[15:28:48] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38991/console" [puppet] - 10https://gerrit.wikimedia.org/r/876207 (owner: 10Ssingh)
[15:30:43] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5032.eqsin.wmnet with OS bullseye
[15:30:47] <wikibugs>	 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye executed with errors: - cp5032 (**FAIL**)   - Downtimed on Icinga/Alertmanager   - Disab...
[15:30:51] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:13] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS bullseye
[15:31:18] <wikibugs>	 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye
[15:32:21] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:34:39] <wikibugs>	 (03PS20) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[15:35:31] <wikibugs>	 (03PS21) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[15:36:40] <wikibugs>	 (03PS22) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[15:37:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38994/console" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[15:39:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[15:40:02] <wikibugs>	 (03PS23) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[15:40:04] <wikibugs>	 (03CR) 10Jbond: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[15:43:29] <wikibugs>	 (03PS1) 10Hashar: wm-checks-api: fix TypeScript noImplicitAny [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/876212
[15:43:41] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:45:25] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:59] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:54:02] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1486.eqiad.wmnet
[15:58:06] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) p:05Triage→03Low
[15:58:45] <wikibugs>	 (03PS2) 10Ahmon Dancy: wmfdebug 0.0.6: Include the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439
[15:58:57] <wikibugs>	 (03CR) 10Xcollazo: "(closing draft comments)" [puppet] - 10https://gerrit.wikimedia.org/r/824241 (https://phabricator.wikimedia.org/T312858) (owner: 10Xcollazo)
[15:59:02] <wikibugs>	 (03CR) 10Ahmon Dancy: wmfdebug 0.0.6: Include the wmf-certificates package (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 (owner: 10Ahmon Dancy)
[16:00:54] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert) racadm getsel log: ` ------------------------------------------------------------------------------- Record:      5 Date/Time:   01/...
[16:01:00] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Clement_Goubert)
[16:02:11] <wikibugs>	 (03CR) 10JMeybohm: flink-operator - add admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:02:54] <wikibugs>	 (03PS3) 10JMeybohm: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:03:09] <wikibugs>	 (03PS1) 10Bking: [WIP] wdqs-data-reload: use NFS for data reloads [cookbooks] - 10https://gerrit.wikimedia.org/r/876217 (https://phabricator.wikimedia.org/T323096)
[16:03:29] <wikibugs>	 (03CR) 10JMeybohm: flink-operator - add admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:04:20] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting:  CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) Thanks for the update. I will extend the downtime to two weeks from now, will revisit if necessary.
[16:05:44] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on parse1002.eqiad.wmnet with reason: CPU1 machine check error
[16:05:47] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on parse1002.eqiad.wmnet with reason: CPU1 machine check error
[16:05:50] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting:  CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5c4b686a-9560-44c1-acb3-c16978d72b37) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 1...
[16:06:23] <wikibugs>	 (03CR) 10Xcollazo: [C: 03+1] "Now that we have conda analytics deployed, and there is a user deadline for moving from anaconda-wmf, do we still need this patch or can w" [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/780898 (https://phabricator.wikimedia.org/T306197) (owner: 10Ottomata)
[16:10:34] <wikibugs>	 (03CR) 10Ottomata: flink and flink-kubernetes-operator image (0314 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[16:12:00] <wikibugs>	 (03PS15) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519)
[16:14:56] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:15:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] wmfdebug 0.0.6: Include the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 (owner: 10Ahmon Dancy)
[16:15:46] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Execution of preseeded command "wget -O /tmp/late_command           │" [puppet] - 10https://gerrit.wikimedia.org/r/876207 (owner: 10Ssingh)
[16:16:49] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] install_server: remove installation of linux-image-5.10-amd64 for cp[45]* [puppet] - 10https://gerrit.wikimedia.org/r/876207 (owner: 10Ssingh)
[16:17:28] <wikibugs>	 (03CR) 10Ottomata: "Interesting, because I don't include all of the expected scaffold templates, some o the test case fixtures are failing." [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:18:01] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5032.eqsin.wmnet with OS bullseye
[16:18:06] <wikibugs>	 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye executed with errors: - cp5032 (**FAIL**)   - Removed from Puppet and PuppetDB if presen...
[16:21:02] <wikibugs>	 (03PS4) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576)
[16:21:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:26:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS bullseye
[16:26:08] <wikibugs>	 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye
[16:29:55] <wikibugs>	 (03PS5) 10Ottomata: flink-operator - add admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/876200 (https://phabricator.wikimedia.org/T324576)
[16:32:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: add data types to validate UIDs [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn)
[16:33:00] <wikibugs>	 (03PS25) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[16:33:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:34:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "nit: please add what the actual problem was that was fixed, but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869717 (owner: 10AOkoth)
[16:36:39] <wikibugs>	 (03PS1) 10MVernon: thanos: drain thanos-be[1,2]004 [puppet] - 10https://gerrit.wikimedia.org/r/876221 (https://phabricator.wikimedia.org/T279621)
[16:38:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:38:16] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/876221 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[16:43:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:43:51] <wikibugs>	 (03CR) 10JMeybohm: flink and flink-kubernetes-operator image (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[16:48:13] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1486 is CRITICAL: Host mw1486 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[16:53:25] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5032.eqsin.wmnet with OS bullseye
[16:53:29] <wikibugs>	 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS bullseye executed with errors: - cp5032 (**FAIL**)   - Removed from Puppet and PuppetDB if presen...
[16:53:41] <wikibugs>	 (03PS26) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[16:53:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:53:54] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS bullseye
[16:58:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) Thank you for deploying will investigate today while on site
[17:06:09] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:15:48] <wikibugs>	 (03Abandoned) 10Ottomata: Actually set REQUESTS_CA_BUNDLE [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/780898 (https://phabricator.wikimedia.org/T306197) (owner: 10Ottomata)
[17:19:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gitlab_runner: lower docker_gc watermarks in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto)
[17:25:25] <wikibugs>	 (03PS1) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811)
[17:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:26:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage
[17:29:29] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage
[17:29:54] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "Systemd::Sysuser[vcs]: has no parameter named 'gid' - https://puppet-compiler.wmflabs.org/output/865207/38996/phab1004.eqiad.wmnet/change." [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn)
[17:42:02] <wikibugs>	 (03PS4) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207
[17:48:47] <wikibugs>	 (03PS1) 10Btullis: Detect the correct disks for the O/S on the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/876237 (https://phabricator.wikimedia.org/T324670)
[17:54:10] <wikibugs>	 (03CR) 10Ahmon Dancy: "There's more to be done to handle the runner-*-concurrent-2-cache-* volumes.  I'll work on a separate commit for that." [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto)
[17:58:07] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5032.eqsin.wmnet with OS bullseye
[18:00:11] <wikibugs>	 (03PS7) 10Raymond Ndibe: tools-webservice: read buildservice_repository from webservice.yaml config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689)
[18:01:02] <wikibugs>	 (03CR) 10Raymond Ndibe: tools-webservice: read buildservice_repository from webservice.yaml config file (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[18:01:20] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: lower docker_gc watermarks in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto)
[18:05:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab_runner: lower docker_gc watermarks in wmcs [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto)
[18:06:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab_runner: lower docker_gc watermarks in wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto)
[18:06:58] <wikibugs>	 (03PS1) 10Ahmon Dancy: Make gitlab-runner cache volumes eligible for docker-gc [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378)
[18:13:25] <Krinkle>	 !log krinkle@cloudweb1003$ Run `UPDATE actor SET actor_user=31136 WHERE actor_id=14640;` to partially fix T326431
[18:13:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Jclark-ctr) Created ticket  Confirmed: Service Request 159722060 was successfully submitted. Submitted TSR report to Dell
[18:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:29] <stashbot>	 T326431: Some system users have invalid 'actor' database rows - https://phabricator.wikimedia.org/T326431
[18:16:04] <icinga-wm>	 PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100%
[18:17:53] <sukhe>	 ^ seems to be expected from SAL?
[18:18:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed on runner-1026.the docker commandline has been changed there accordingly" [puppet] - 10https://gerrit.wikimedia.org/r/876184 (https://phabricator.wikimedia.org/T326378) (owner: 10Jelto)
[18:18:43] <zabe>	 T326425, I guess
[18:18:43] <stashbot>	 T326425: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425
[18:19:31] <RhinosF1>	 sukhe: maybe worth a downtime if it’s out of service
[18:19:36] <sukhe>	 yeah
[18:19:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Dzahn) 18:16 <+icinga-wm> PROBLEM - Host mw1486 is DOWN: PING CRITICAL - Packet loss = 100%
[18:20:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on mw1486.eqiad.wmnet with reason: downtimed, hw failure: T326425
[18:20:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on mw1486.eqiad.wmnet with reason: downtimed, hw failure: T326425
[18:21:28] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:21:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: power consumption reboot failure for mw1486.eqiad.wmnet - https://phabricator.wikimedia.org/T326425 (10Dzahn) 05Open→03In progress
[18:21:41] <RhinosF1>	 Thanks sukhe
[18:21:48] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:22:51] <sukhe>	 thanks all
[18:23:38] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:24:58] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.557 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:25:04] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:26:04] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:33:40] <wikibugs>	 (03CR) 10Dzahn: "I was looking for the upstream man page / docs for the --volume-filter parameter but somehow it wasn't obvious where that is. not in my lo" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[18:34:07] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5032.eqsin.wmnet,service=cdn
[18:34:08] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5032.eqsin.wmnet,service=ats-be
[18:35:49] <wikibugs>	 (03CR) 10Dzahn: "https://docs.docker.com/search/?q=volume-filter" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[18:36:12] <sukhe>	 !log pool cp5032 [bullseye upgrade completed]: T325797
[18:36:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:15] <stashbot>	 T325797: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797
[18:36:42] <wikibugs>	 (03CR) 10Ahmon Dancy: Make gitlab-runner cache volumes eligible for docker-gc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[18:39:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "thanks! lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[18:41:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Make gitlab-runner cache volumes eligible for docker-gc [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[18:42:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed docker::gc is only used in gitlab::runner profile" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[18:47:58] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:00] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed. --volume-filter 'label:com.gitlab.gitlab-runner.type=cache' has been added to the docker commandline on gitlab-runner* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[18:48:04] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:12] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:48:19] <mutante>	 well, that would be me then
[18:48:21] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[18:48:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "well..  PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[18:49:03] <dancy>	 I'm around for debugging
[18:49:09] <mutante>	 dancy: Jan 06 18:48:34 gitlab-runner1002 docker[1218708]: Invalid filter: 'label:com.gitlab.gitlab-runner.type>
[18:49:10] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:49:12] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:49:37] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: debugging
[18:49:41] <dancy>	 hmm.. I'll look into that.  Feel free to revert in the meantime.
[18:49:46] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:49:49] <sukhe>	 Jan 06 18:44:40 gitlab-runner2002 docker[1216220]: Invalid filter: 'label:com.gitlab.gitlab-runner.type=cache'
[18:49:51] <mutante>	 dancy: 30 min downtime :)
[18:49:59] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: debugging
[18:50:16] <mutante>	 sukhe: thanks! yea, we just added that filter thing to the commandline
[18:50:21] <mutante>	 something about the format
[18:50:39] <sukhe>	 mutante: let us know if you need an extra pair of eyes (on-call person :)
[18:51:18] <dancy>	 ok. needs to be ==, not =
[18:51:28] <dancy>	 fixing...
[18:51:34] <mutante>	 sukhe: thank you! I don't see any production services being affected. it's just that garbage collection won't run. and it's downtimed and will be fixed very soon :)
[18:51:42] <sukhe>	 ok!
[18:52:10] <sukhe>	 mutante: had a clean week so just being careful :P 
[18:52:35] <mutante>	 sukhe: appreciate the reaction to icinga alerts :))
[18:52:58] <wikibugs>	 (03PS1) 10Ahmon Dancy: Followup 3221b39a5736dd99befd5b72618c7c854dfb5252 [puppet] - 10https://gerrit.wikimedia.org/r/876247
[18:53:07] <mutante>	 also it was nice that downtime cookbook takes wildcard in hostname
[18:54:17] <wikibugs>	 (03PS2) 10Dzahn: docker::gc: fix syntax for volume-filter [puppet] - 10https://gerrit.wikimedia.org/r/876247 (owner: 10Ahmon Dancy)
[18:54:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] docker::gc: fix syntax for volume-filter [puppet] - 10https://gerrit.wikimedia.org/r/876247 (owner: 10Ahmon Dancy)
[18:54:57] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] docker::gc: fix syntax for volume-filter [puppet] - 10https://gerrit.wikimedia.org/r/876247 (owner: 10Ahmon Dancy)
[18:56:53] <mutante>	 !log gitlab-runner1002 - systemctl start docker-gc; run puppet on all gitlab-runners T310593
[18:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:57] <stashbot>	 T310593: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593
[18:57:02] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:04] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:24] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:28] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:30] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:42] <mutante>	 !log systemctl start docker-gc on all gitlab-runners via cumin T310593
[18:57:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:48] <mutante>	 dancy: all good
[18:57:57] <mutante>	 sukhe: fixed
[18:58:07] <dancy>	 Great.  Sorry for the noise
[18:58:15] <mutante>	 no problem at all
[18:58:34] <sukhe>	 thakns all <3
[18:59:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/876247" [puppet] - 10https://gerrit.wikimedia.org/r/876240 (https://phabricator.wikimedia.org/T326378) (owner: 10Ahmon Dancy)
[19:00:38] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+2] tools-webservice: read buildservice_repository from webservice.yaml config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[19:02:18] <wikibugs>	 (03Merged) 10jenkins-bot: tools-webservice: read buildservice_repository from webservice.yaml config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[19:05:32] <wikibugs>	 (03CR) 10Dzahn: ": parameter 'id' expects a Systemd::Sysuser::Id = Variant[Integer[0], Enum['-'], Stdlib::Unixpath = Pattern[/\A\/([^\n\/\0]+\/*)*\z/], Pat" [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn)
[19:13:42] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[19:15:18] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:16:07] <wikibugs>	 (03PS1) 10Southparkfan: rsyslog: allow subject name validation [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717)
[19:19:45] <wikibugs>	 (03PS5) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207
[19:21:38] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[19:23:12] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:24:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:25:01] * sukhe sings a song to bring it down to lt 100
[19:25:38] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 41 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:27:23] <mutante>	 fixed by (wiki whisperer) sukhe
[19:28:12] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "parameter 'additional_groups' expects an Array value MEEEP" [puppet] - 10https://gerrit.wikimedia.org/r/865207 (owner: 10Dzahn)
[19:28:44] <wikibugs>	 (03PS6) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207
[19:30:20] <sukhe>	 :D
[19:34:37] <wikibugs>	 (03PS1) 10Ottomata: Update flink-kubernetes-operator chart with upstream changes for 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519)
[19:43:39] <wikibugs>	 (03PS2) 10Ottomata: Update flink-kubernetes-operator chart with upstream changes for 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/876249 (https://phabricator.wikimedia.org/T316519)
[19:45:44] <wikibugs>	 (03PS1) 10Southparkfan: profile::base: fix hiera key name fox tls_client_auth [puppet] - 10https://gerrit.wikimedia.org/r/876251 (https://phabricator.wikimedia.org/T127717)
[19:46:25] <wikibugs>	 (03PS2) 10Southparkfan: profile::base: fix hiera key name for tls_client_auth [puppet] - 10https://gerrit.wikimedia.org/r/876251 (https://phabricator.wikimedia.org/T127717)
[19:47:11] <wikibugs>	 (03CR) 10Southparkfan: rsyslog: allow subject name validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[19:47:56] <wikibugs>	 (03CR) 10Southparkfan: rsyslog: allow subject name validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/876248 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[19:50:37] <wikibugs>	 (03PS16) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519)
[19:51:08] <wikibugs>	 (03CR) 10Ottomata: flink and flink-kubernetes-operator image (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[19:51:50] <wikibugs>	 (03CR) 10Ottomata: flink and flink-kubernetes-operator image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata)
[19:57:38] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[19:59:10] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[20:05:42] <wikibugs>	 (03PS27) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[20:10:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:13:36] <wikibugs>	 (03CR) 10Ottomata: flink-app chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[20:15:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:52:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "well, now we have a range up to 499 and one starting at 1000 but the first one I wanted to use it with, phd, is 920." [puppet] - 10https://gerrit.wikimedia.org/r/875446 (owner: 10Dzahn)
[23:03:44] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:15:36] <wikibugs>	 (03PS9) 10Krinkle: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy)
[23:19:58] <wikibugs>	 (03PS10) 10Krinkle: doc: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy)
[23:21:48] <wikibugs>	 (03PS11) 10Krinkle: doc: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy)
[23:21:57] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] doc: Relax CSP rules for taint-check-demo [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy)
[23:23:50] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Scheduled for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700" [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy)
[23:23:55] <wikibugs>	 (03CR) 10Krinkle: "Scheduled for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230110T1700" [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) (owner: 10Krinkle)
[23:29:42] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] doc: Relax CSP rules for taint-check-demo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy)