[00:25:01] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[00:31:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:39:26] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919382
[00:39:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919382 (owner: 10TrainBranchBot)
[00:59:22] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919382 (owner: 10TrainBranchBot)
[01:03:34] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T336720 (10phaultfinder)
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0200)
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.9 [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/919383 (https://phabricator.wikimedia.org/T330215)
[02:08:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.9 [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/919383 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[02:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:13:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:21:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:24:02] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.9 [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/919383 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:56:55] <wikibugs>	 (03CR) 10TChin: Add flink-app default log config and use it in page_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin)
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0300)
[03:01:08] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:30] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919934 (https://phabricator.wikimedia.org/T330215)
[03:01:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919934 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[03:02:14] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919934 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[03:02:46] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.9  refs T330215
[03:02:51] <stashbot>	 T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215
[03:10:08] <icinga-wm>	 RECOVERY - Check systemd state on an-airflow1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:26:14] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:51:34] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.9  refs T330215 (duration: 48m 47s)
[03:51:39] <stashbot>	 T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215
[03:54:03] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.6, 1.41.0-wmf.7 (duration: 02m 26s)
[04:15:01] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:18:04] <icinga-wm>	 RECOVERY - dump of backup1-codfw in codfw on backupmon1001 is OK: Last dump for backup1-codfw at codfw (db2184) taken on 2023-05-16 03:53:29 (15 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:08:55] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910 (10Marostegui)
[05:10:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Marostegui) Thanks @Jclark-ctr  @jcrespo can you take care of putting this host back in service as it is a backup source one?
[05:20:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 T336337', diff saved to https://phabricator.wikimedia.org/P48236 and previous config saved to /var/cache/conftool/dbconfig/20230516-052014-root.json
[05:20:19] <stashbot>	 T336337: Failover s4 sanitarium master - https://phabricator.wikimedia.org/T336337
[05:20:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1221 T336337', diff saved to https://phabricator.wikimedia.org/P48237 and previous config saved to /var/cache/conftool/dbconfig/20230516-052026-root.json
[05:24:29] <wikibugs>	 (03PS1) 10Marostegui: db1121: No longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/920135 (https://phabricator.wikimedia.org/T336337)
[05:25:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1121: No longer sanitarium master [puppet] - 10https://gerrit.wikimedia.org/r/920135 (https://phabricator.wikimedia.org/T336337) (owner: 10Marostegui)
[05:27:38] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Update sanitarium master for s4 [puppet] - 10https://gerrit.wikimedia.org/r/920137 (https://phabricator.wikimedia.org/T336337)
[05:28:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Update sanitarium master for s4 [puppet] - 10https://gerrit.wikimedia.org/r/920137 (https://phabricator.wikimedia.org/T336337) (owner: 10Marostegui)
[05:29:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48238 and previous config saved to /var/cache/conftool/dbconfig/20230516-052920-root.json
[05:29:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48239 and previous config saved to /var/cache/conftool/dbconfig/20230516-052936-root.json
[05:30:24] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1113.eqiad.wmnet - https://phabricator.wikimedia.org/T336029 (10Marostegui)
[05:32:12] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Failover pc3 codfw host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920139
[05:33:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Failover pc3 codfw host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920139 (owner: 10Marostegui)
[05:33:11] <wikibugs>	 (03PS1) 10Marostegui: pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920140
[05:33:54] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Failover pc3 codfw host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920139 (owner: 10Marostegui)
[05:35:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920140 (owner: 10Marostegui)
[05:36:32] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:920139|ProductionServices.php: Failover pc3 codfw host]]
[05:38:05] <logmsgbot>	 !log marostegui@deploy1002 marostegui: Backport for [[gerrit:920139|ProductionServices.php: Failover pc3 codfw host]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[05:43:47] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:920139|ProductionServices.php: Failover pc3 codfw host]] (duration: 07m 15s)
[05:44:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48240 and previous config saved to /var/cache/conftool/dbconfig/20230516-054425-root.json
[05:44:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48241 and previous config saved to /var/cache/conftool/dbconfig/20230516-054441-root.json
[05:44:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc2014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/919322
[05:45:04] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Failover pc3 codfw host" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919323
[05:51:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 T336332', diff saved to https://phabricator.wikimedia.org/P48242 and previous config saved to /var/cache/conftool/dbconfig/20230516-055122-root.json
[05:51:28] <stashbot>	 T336332: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332
[05:52:32] <wikibugs>	 (03PS1) 10Marostegui: db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920142 (https://phabricator.wikimedia.org/T336332)
[05:53:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1112: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/920142 (https://phabricator.wikimedia.org/T336332) (owner: 10Marostegui)
[05:53:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc2014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/919322 (owner: 10Marostegui)
[05:53:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Failover pc3 codfw host" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919323 (owner: 10Marostegui)
[05:54:29] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Failover pc3 codfw host" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919323 (owner: 10Marostegui)
[05:58:07] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:919323|Revert "ProductionServices.php: Failover pc3 codfw host"]]
[05:59:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48243 and previous config saved to /var/cache/conftool/dbconfig/20230516-055929-root.json
[05:59:36] <logmsgbot>	 !log marostegui@deploy1002 marostegui: Backport for [[gerrit:919323|Revert "ProductionServices.php: Failover pc3 codfw host"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[05:59:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48244 and previous config saved to /var/cache/conftool/dbconfig/20230516-055946-root.json
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0600)
[06:00:06] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0600).
[06:02:15] <wikibugs>	 (03PS1) 10Marostegui: pc2014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/920143
[06:05:28] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:919323|Revert "ProductionServices.php: Failover pc3 codfw host"]] (duration: 07m 21s)
[06:09:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/920143 (owner: 10Marostegui)
[06:12:34] <wikibugs>	 (03PS1) 10Marostegui: pc1014: Make it pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/920146
[06:13:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:13:34] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920147
[06:14:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1014: Make it pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/920146 (owner: 10Marostegui)
[06:14:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920147 (owner: 10Marostegui)
[06:14:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48245 and previous config saved to /var/cache/conftool/dbconfig/20230516-061434-root.json
[06:14:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48246 and previous config saved to /var/cache/conftool/dbconfig/20230516-061450-root.json
[06:15:04] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920147 (owner: 10Marostegui)
[06:17:26] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:920147|ProductionServices.php: Promote pc1014 to pc3 master]]
[06:18:51] <logmsgbot>	 !log marostegui@deploy1002 marostegui: Backport for [[gerrit:920147|ProductionServices.php: Promote pc1014 to pc3 master]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[06:24:34] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:920147|ProductionServices.php: Promote pc1014 to pc3 master]] (duration: 07m 08s)
[06:25:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:29:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48247 and previous config saved to /var/cache/conftool/dbconfig/20230516-062939-root.json
[06:29:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48248 and previous config saved to /var/cache/conftool/dbconfig/20230516-062955-root.json
[06:30:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:31:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: Add auto_prepend_file to PHP config_cli (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle)
[06:33:31] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919324
[06:33:46] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc1014: Make it pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/919325
[06:40:33] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[06:44:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48249 and previous config saved to /var/cache/conftool/dbconfig/20230516-064444-root.json
[06:45:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48250 and previous config saved to /var/cache/conftool/dbconfig/20230516-064500-root.json
[06:46:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] jobrunner: reduce max_requests_per_connection to 100 [puppet] - 10https://gerrit.wikimedia.org/r/919262 (https://phabricator.wikimedia.org/T336554) (owner: 10Giuseppe Lavagetto)
[06:47:00] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:48:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919324 (owner: 10Marostegui)
[06:49:07] <_joe_>	 !log running docker image prune -a in build2001
[06:49:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:24] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919324 (owner: 10Marostegui)
[06:49:54] <logmsgbot>	 !log marostegui@deploy1002 Started scap: Backport for [[gerrit:919324|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]]
[06:51:26] <logmsgbot>	 !log marostegui@deploy1002 marostegui: Backport for [[gerrit:919324|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[06:52:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc1014: Make it pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/919325 (owner: 10Marostegui)
[06:56:52] <logmsgbot>	 !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:919324|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] (duration: 06m 58s)
[06:57:24] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[06:59:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48251 and previous config saved to /var/cache/conftool/dbconfig/20230516-065948-root.json
[07:00:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48252 and previous config saved to /var/cache/conftool/dbconfig/20230516-070005-root.json
[07:00:06] <jouncebot>	 Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T0700)
[07:00:06] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:07:20] <wikibugs>	 (03PS1) 10Slyngshede: k8s upgrade cluster: use sre.hosts.reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/920192 (https://phabricator.wikimedia.org/T336491)
[07:09:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s upgrade cluster: use sre.hosts.reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/920192 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede)
[07:14:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48253 and previous config saved to /var/cache/conftool/dbconfig/20230516-071453-root.json
[07:15:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48254 and previous config saved to /var/cache/conftool/dbconfig/20230516-071509-root.json
[07:15:35] <wikibugs>	 (03PS2) 10Slyngshede: k8s upgrade cluster: use sre.hosts.reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/920192 (https://phabricator.wikimedia.org/T336491)
[07:16:06] <wikibugs>	 (03CR) 10Muehlenhoff: Obsolete profile::python37 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917813 (owner: 10Muehlenhoff)
[07:16:30] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1220 [puppet] - 10https://gerrit.wikimedia.org/r/920193
[07:26:14] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:28:51] <Emperor>	 !log restart vopsbot.service on alert1001
[07:28:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:28] <wikibugs>	 (03PS1) 10Marostegui: production-m5.sql: Add ipoid grants [puppet] - 10https://gerrit.wikimedia.org/r/920194 (https://phabricator.wikimedia.org/T305114)
[07:31:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10jcrespo) I was working on it already :-D, was going to notify when completed, as it has 3 sections and I have so far only loaded back 2.
[07:31:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1220 [puppet] - 10https://gerrit.wikimedia.org/r/920193 (owner: 10Marostegui)
[07:33:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: docker::baseimages: skip building bookworm [puppet] - 10https://gerrit.wikimedia.org/r/920196 (https://phabricator.wikimedia.org/T335560)
[07:34:53] <icinga-wm>	 PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS1257/IPv6: Connect - Tele2, AS1257/IPv4: Active - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:38:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Marostegui) \o/
[07:40:51] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41194/console" [puppet] - 10https://gerrit.wikimedia.org/r/918427 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto)
[07:42:16] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: docker::baseimages: skip building bookworm [puppet] - 10https://gerrit.wikimedia.org/r/920196 (https://phabricator.wikimedia.org/T335560)
[07:43:41] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add new dns host dns2005 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/919876 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh)
[07:44:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41195/console" [puppet] - 10https://gerrit.wikimedia.org/r/920196 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto)
[07:44:50] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: run backup sync and restore twice daily [puppet] - 10https://gerrit.wikimedia.org/r/918427 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto)
[07:45:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] docker::baseimages: skip building bookworm [puppet] - 10https://gerrit.wikimedia.org/r/920196 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto)
[07:52:15] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner
[07:54:02] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) (owner: 10BryanDavis)
[07:58:01] <jinxer-wm>	 (NodeTextfileStale) resolved: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:59:09] <wikibugs>	 10SRE-swift-storage: Thanos root filesystem filling with logs - https://phabricator.wikimedia.org/T329712 (10MatthewVernon) 05Open→03Resolved [this was resolved back in February - we moved the two thanos backends back into service and added on delaycompress]
[08:12:08] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: disable pint promql/series check for SystemdUnitFailed [alerts] - 10https://gerrit.wikimedia.org/r/920199 (https://phabricator.wikimedia.org/T309182)
[08:12:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata)
[08:14:47] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff)
[08:15:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:16:41] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff)
[08:17:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy2003.codfw.wmnet with reason: Maintenance
[08:17:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy2003.codfw.wmnet with reason: Maintenance
[08:17:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbproxy2004.codfw.wmnet with reason: Maintenance
[08:18:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbproxy2004.codfw.wmnet with reason: Maintenance
[08:18:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc2014.codfw.wmnet with reason: Maintenance
[08:18:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc2014.codfw.wmnet with reason: Maintenance
[08:18:27] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: name=ldap-replica2006.wikimedia.org
[08:19:17] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff)
[08:21:47] <wikibugs>	 (03CR) 10Jaime Nuche: "Thank you for the merge!" [labs/private] - 10https://gerrit.wikimedia.org/r/919833 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[08:23:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on es[2023-2025].codfw.wmnet with reason: maintenance
[08:23:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on es[2023-2025].codfw.wmnet with reason: maintenance
[08:24:13] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff)
[08:25:18] <wikibugs>	 (03CR) 10Jaime Nuche: "Thank a lot for the fix, merging and monitoring. I really appreciate all the effort." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[08:26:09] <icinga-wm>	 PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:27:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: disable pint promql/series check for SystemdUnitFailed [alerts] - 10https://gerrit.wikimedia.org/r/920199 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[08:28:27] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Switch cp4052 to HAProxy 2.7 branch [puppet] - 10https://gerrit.wikimedia.org/r/919862 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[08:33:25] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 3 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) Thanks @Dzahn - That's a useful reference. I've created two user accounts in Matomo for `twi...
[08:33:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: ignore promql/series for code/thanos-query-frontend [alerts] - 10https://gerrit.wikimedia.org/r/920201 (https://phabricator.wikimedia.org/T309182)
[08:33:51] <icinga-wm>	 RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:35:23] <icinga-wm>	 PROBLEM - haproxy process on cp4052 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[08:35:33] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[08:35:37] <icinga-wm>	 PROBLEM - Check systemd state on cp4052 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:35:42] <vgutierrez>	 ^^ cp4052 is me and it's currently depooled
[08:36:50] <wikibugs>	 (03PS1) 10Mvolz: Update Zotero to most recent version [deployment-charts] - 10https://gerrit.wikimedia.org/r/920202 (https://phabricator.wikimedia.org/T336727)
[08:36:57] <icinga-wm>	 RECOVERY - haproxy process on cp4052 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy
[08:37:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: ignore promql/series for code/thanos-query-frontend [alerts] - 10https://gerrit.wikimedia.org/r/920201 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[08:37:07] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 426172 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-07-23 06:25:44 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS
[08:37:13] <icinga-wm>	 RECOVERY - Check systemd state on cp4052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:38:37] <icinga-wm>	 PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:38:41] <wikibugs>	 (03PS2) 10Jcrespo: Revert "db1225: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/918446
[08:39:08] <wikibugs>	 (03CR) 10Jcrespo: "We are ready to get db1225 into production" [puppet] - 10https://gerrit.wikimedia.org/r/918446 (owner: 10Jcrespo)
[08:40:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff, 10User-jbond: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163 (10MoritzMuehlenhoff)
[08:42:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-Kormat: debdeploy skipped hosts and assumed they're up to date(?) - https://phabricator.wikimedia.org/T268735 (10MoritzMuehlenhoff) 05Open→03Declined Old task, no longer really actionable at this point and this hasn't been seen since then.
[08:43:00] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "db1225: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/918446 (owner: 10Jcrespo)
[08:43:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on es2033.codfw.wmnet with reason: Maintenance
[08:43:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on es2033.codfw.wmnet with reason: Maintenance
[08:43:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 7:00:00 on es2034.codfw.wmnet with reason: Maintenance
[08:44:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on es2034.codfw.wmnet with reason: Maintenance
[08:46:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10jcrespo) All 3 sections loaded and replicating, I have reverted the notifications disabled patch. All done.
[08:49:20] <wikibugs>	 (03PS1) 10Slyngshede: sre.ganeti.makevm call reimage after VM creation [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491)
[08:49:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Temporary drop krb1001 from KDC list used by clients [puppet] - 10https://gerrit.wikimedia.org/r/920204 (https://phabricator.wikimedia.org/T331695)
[08:49:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: dcops: temp disable promql/series pint check for InterfaceSpeedError [alerts] - 10https://gerrit.wikimedia.org/r/920205 (https://phabricator.wikimedia.org/T309182)
[08:50:59] <wikibugs>	 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10wmde-wikidata-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10OlafJanssen) >>! In T331356#8849873, @Ladsgroup wrote: > Until it gets changed to HTTPS, basically we have two options: >  - Remove the l...
[08:52:49] <wikibugs>	 (03PS1) 10Effie Mouzeli: php-multiversion-base: add rsvg-convert [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920206 (https://phabricator.wikimedia.org/T336025)
[08:59:01] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Set nbthreads on the first global section [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799)
[08:59:08] <wikibugs>	 (03PS3) 10JMeybohm: Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324)
[08:59:10] <wikibugs>	 (03PS4) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324)
[09:01:05] <wikibugs>	 (03PS1) 10Klausman: admin_ng: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208
[09:01:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin_ng: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (owner: 10Klausman)
[09:04:40] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] php-multiversion-base: add rsvg-convert [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920206 (https://phabricator.wikimedia.org/T336025) (owner: 10Effie Mouzeli)
[09:06:44] <wikibugs>	 (03PS5) 10Samtar: InitialiseSettings: Set wgWatchersMaxAge=30days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti)
[09:08:37] <wikibugs>	 (03PS2) 10Klausman: admin_ng: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208
[09:11:43] <icinga-wm>	 PROBLEM - Check systemd state on es1020 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:11:45] <wikibugs>	 (03PS3) 10Klausman: admin_ng: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124)
[09:16:27] <wikibugs>	 (03PS1) 10Fabfur: admin: Add fabfur user [puppet] - 10https://gerrit.wikimedia.org/r/920209
[09:16:29] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur)
[09:20:16] <jnuche>	 jouncebot: nowandnext
[09:20:16] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 39 minute(s)
[09:20:17] <jouncebot>	 In 0 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1000)
[09:21:09] <marostegui>	 !log Optimize s5 on dbstore1003 T336733
[09:21:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:14] <stashbot>	 T336733: dbstore1003 filling up - https://phabricator.wikimedia.org/T336733
[09:21:22] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Set nbthreads on the first global section [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799)
[09:22:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025)
[09:23:02] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.52.2" for 595 hosts
[09:23:14] <wikibugs>	 (03PS2) 10Effie Mouzeli: php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025)
[09:23:27] <logmsgbot>	 !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.reboot-runner (exit_code=1) rolling reboot on A:gitlab-runner
[09:23:49] <wikibugs>	 (03PS3) 10Effie Mouzeli: php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025)
[09:24:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025) (owner: 10Effie Mouzeli)
[09:25:20] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1004.eqiad.wmnet
[09:25:37] <icinga-wm>	 RECOVERY - Check systemd state on es1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] php-multiversion-base: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920210 (https://phabricator.wikimedia.org/T336025) (owner: 10Effie Mouzeli)
[09:26:08] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10hnowlan)
[09:26:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations: DHCP error while trying to run the reimaging cookbook for dns2005.wikimedia.org (install server install2004.wikimedia.org) - https://phabricator.wikimedia.org/T336696 (10Volans) 05Open→03Resolved a:03Volans File removed `sudo rm mgmt-codfw/ssw1-a1-codfw.mgmt.codfw.wmn...
[09:28:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: perf: disable promql/series lint checks for navtiming [alerts] - 10https://gerrit.wikimedia.org/r/920211 (https://phabricator.wikimedia.org/T309182)
[09:28:56] <wikibugs>	 (03PS3) 10Vgutierrez: cache::haproxy: Set nbthreads on the first global section [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799)
[09:30:16] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41198/console" [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[09:31:12] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1004.eqiad.wmnet
[09:32:34] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: add RevertRisk Wikidata model to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/919364 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou)
[09:33:29] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Set nbthreads on the first global section [puppet] - 10https://gerrit.wikimedia.org/r/920207 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[09:36:16] <wikibugs>	 (03PS2) 10Fabfur: admin: Add fabfur user - more readable [puppet] - 10https://gerrit.wikimedia.org/r/920209
[09:36:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations: DHCP error while trying to run the reimaging cookbook for dns2005.wikimedia.org (install server install2004.wikimedia.org) - https://phabricator.wikimedia.org/T336696 (10Volans) FYI the workflow is described at https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Wo...
[09:37:18] <icinga-wm>	 RECOVERY - Host gitlab-runner1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[09:38:16] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet
[09:40:30] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Fix missing socket variable [puppet] - 10https://gerrit.wikimedia.org/r/920212 (https://phabricator.wikimedia.org/T317799)
[09:41:59] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41199/console" [puppet] - 10https://gerrit.wikimedia.org/r/920212 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[09:43:29] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Fix missing socket variable [puppet] - 10https://gerrit.wikimedia.org/r/920212 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[09:43:37] <wikibugs>	 (03PS3) 10Fabfur: admin: Add fabfur user [puppet] - 10https://gerrit.wikimedia.org/r/920209
[09:44:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] dcops: temp disable promql/series pint check for InterfaceSpeedError [alerts] - 10https://gerrit.wikimedia.org/r/920205 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:44:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] perf: disable promql/series lint checks for navtiming [alerts] - 10https://gerrit.wikimedia.org/r/920211 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:44:55] <wikibugs>	 (03CR) 10Elukey: "Looks good to me (left a nit for the commit msg)! I don't see the revert risk namespace and configs in ml-staging-codfw, so you'll probabl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[09:44:59] <wikibugs>	 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): gitlab-runner1003 is not coming back online - https://phabricator.wikimedia.org/T336737 (10Jelto)
[09:45:30] <wikibugs>	 (03PS10) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016
[09:45:32] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718
[09:45:34] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Do not use firejail on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920213
[09:45:44] <wikibugs>	 (03PS4) 10Klausman: helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124)
[09:45:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[09:46:15] <wikibugs>	 (03CR) 10Klausman: helmfile.d: add revertrisk model config to ml-serve clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[09:46:27] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[09:46:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] admin: Add fabfur user [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur)
[09:49:13] <logmsgbot>	 !log btullis@deploy1002 Started deploy [airflow-dags/analytics_product@7642b62]: (no justification provided)
[09:49:22] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [airflow-dags/analytics_product@7642b62]: (no justification provided) (duration: 00m 09s)
[09:49:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM! I added Ben to the code review so Data Engineering can comment as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/919802 (owner: 10Majavah)
[09:50:31] <wikibugs>	 (03CR) 10AikoChou: [C: 03+1] helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[09:51:06] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41200/console" [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur)
[09:51:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/920192 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede)
[09:51:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur)
[09:52:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: add RevertRisk Wikidata model to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/919364 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou)
[09:52:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: add RevertRisk Wikidata model to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/919364 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou)
[09:53:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10MoritzMuehlenhoff) Thanks for all the input, much appreciated! I'll revise the plan and update the task in the next days.
[09:55:31] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] admin: Add fabfur user [puppet] - 10https://gerrit.wikimedia.org/r/920209 (owner: 10Fabfur)
[09:56:31] <wikibugs>	 10SRE, 10Observability-Metrics, 10Traffic, 10User-fgiunchedi: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740 (10fgiunchedi)
[09:58:09] <wikibugs>	 (03PS1) 10Ladsgroup: Prepare for v0.1.3 release [software/wmfdb] - 10https://gerrit.wikimedia.org/r/920214 (https://phabricator.wikimedia.org/T334455)
[09:58:36] <icinga-wm>	 PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:58:49] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.makevm call reimage after VM creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede)
[09:59:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Temporary drop krb1001 from KDC list used by clients [puppet] - 10https://gerrit.wikimedia.org/r/920204 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1000)
[10:03:07] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:03:10] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:03:16] <wikibugs>	 10SRE, 10Observability-Metrics, 10serviceops, 10User-fgiunchedi: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740 (10fgiunchedi)
[10:04:18] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "There's a small issue with a condition" [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede)
[10:06:09] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 3 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis)
[10:06:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[10:07:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[10:11:52] <wikibugs>	 (03PS1) 10Elukey: conftool-data: add discovery config for the k8s-ingress-mlserve [puppet] - 10https://gerrit.wikimedia.org/r/920215
[10:12:08] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks." [alerts] - 10https://gerrit.wikimedia.org/r/920205 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[10:12:47] <wikibugs>	 (03PS2) 10Elukey: conftool-data: add discovery config for the k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920215
[10:13:44] <Amir1>	 !log cleaning up echo notification table in all wikis (T318523)
[10:13:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:48] <stashbot>	 T318523: Don't send article-linked notifications for bots - https://phabricator.wikimedia.org/T318523
[10:23:01] <wikibugs>	 (03PS1) 10Effie Mouzeli: Revert "php-multiversion-base: add rsvg-convert" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920228
[10:23:25] <wikibugs>	 (03PS1) 10Elukey: Add VIP records for the new k8s-ingress-ml-serve endpoint [dns] - 10https://gerrit.wikimedia.org/r/920216 (https://phabricator.wikimedia.org/T336726)
[10:26:14] <wikibugs>	 (03CR) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[10:26:37] <wikibugs>	 10SRE, 10Domains: Mark Monitor administration panel - https://phabricator.wikimedia.org/T333827 (10Jacek_Broda_WMPL) a:05Jacek_Broda_WMPL→03None
[10:27:14] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Use HAProxy 2.7.x on cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/920217 (https://phabricator.wikimedia.org/T317799)
[10:28:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.dns.netbox
[10:29:32] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Use HAProxy 2.7.x on cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/920217 (https://phabricator.wikimedia.org/T317799)
[10:29:46] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in codfw: codfw row D switches upgrade - T335042
[10:29:50] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[10:30:05] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D switches...
[10:30:17] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Use HAProxy 2.7.x on cp5032 [puppet] - 10https://gerrit.wikimedia.org/r/920217 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[10:32:43] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:32:46] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:32:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[10:33:14] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new VIP records for k8s-ingress-ml-serve - elukey@cumin1001"
[10:33:45] <vgutierrez>	 !log testing HAProxy 2.7.8 in cp4052 and cp5032 (upload) - T317799
[10:33:48] <vgutierrez>	 ^^ cdanis 
[10:33:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:49] <stashbot>	 T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799
[10:33:56] <wikibugs>	 (03Merged) 10jenkins-bot: Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[10:34:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new VIP records for k8s-ingress-ml-serve - elukey@cumin1001"
[10:34:19] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:34:52] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on mc-wf[2001-2002].codfw.wmnet,mc-wf[1001-1002].eqiad.wmnet with reason: kernel upgrade
[10:35:07] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:30:00 on mc-wf[2001-2002].codfw.wmnet,mc-wf[1001-1002].eqiad.wmnet with reason: kernel upgrade
[10:35:09] <wikibugs>	 (03PS1) 10Elukey: service::catalog: add initial config for k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920218 (https://phabricator.wikimedia.org/T336726)
[10:35:24] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] admin_ng, thumbor: double memory limit for namespace and pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/919808 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[10:36:38] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:36:41] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[10:37:55] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng, thumbor: double memory limit for namespace and pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/919808 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[10:38:37] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[10:39:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:39:05] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[10:39:16] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[10:40:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[10:42:44] <wikibugs>	 (03CR) 10Vgutierrez: trafficserver: allow partial traffic flow to mw on k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038) (owner: 10Giuseppe Lavagetto)
[10:43:11] <logmsgbot>	 !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host gitlab-runner1003.eqiad.wmnet
[10:43:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: make mw-on-k8s use a config file [puppet] - 10https://gerrit.wikimedia.org/r/917840 (https://phabricator.wikimedia.org/T336037) (owner: 10Giuseppe Lavagetto)
[10:44:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:46:20] <wikibugs>	 (03PS1) 10Elukey: service::catalog: switch k8s-ingress-ml-serve to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/920219 (https://phabricator.wikimedia.org/T336726)
[10:46:22] <wikibugs>	 (03PS1) 10Elukey: service::catalog: switch k8s-ingress-ml-serve to production [puppet] - 10https://gerrit.wikimedia.org/r/920220 (https://phabricator.wikimedia.org/T336726)
[10:46:43] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row D switches...
[10:48:13] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) depool all active/active services in codfw: codfw row D switches upgrade - T335042
[10:48:19] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[10:48:27] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter status all services in all: None - None
[10:48:30] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None
[10:48:44] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter status all services in all: None - None
[10:48:46] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None
[10:49:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[10:50:01] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[10:50:03] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1010.eqiad.wmnet
[10:51:16] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet
[10:51:30] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet
[10:51:38] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet
[10:52:04] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[10:52:53] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[10:54:33] <wikibugs>	 (03PS1) 10Elukey: Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726)
[10:54:35] <wikibugs>	 (03PS1) 10Elukey: Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726)
[10:55:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add discovery configuration for k8s-ingress-ml-serve [dns] - 10https://gerrit.wikimedia.org/r/920221 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey)
[10:55:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey)
[10:56:03] <wikibugs>	 (03PS2) 10Elukey: Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726)
[10:56:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add ores-legacy.discovery.wment configuration [dns] - 10https://gerrit.wikimedia.org/r/920222 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey)
[10:58:13] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1010.eqiad.wmnet
[10:58:36] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet
[10:59:09] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet
[11:00:03] <moritzm>	 !log updated bookworm image to RC3 T330495
[11:00:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:17] <stashbot>	 T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495
[11:00:17] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet
[11:01:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2004.codfw.wmnet with OS bookworm
[11:02:48] <wikibugs>	 (03PS2) 10Slyngshede: signup:blocklist Expand blocklist feature [software/bitu] - 10https://gerrit.wikimedia.org/r/919005
[11:03:33] <wikibugs>	 (03PS1) 10Volans: dhcp: cleanup the snippet on refresh failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/920224 (https://phabricator.wikimedia.org/T336696)
[11:03:35] <wikibugs>	 (03PS1) 10Volans: dhcp: reword some exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/920225
[11:03:37] <wikibugs>	 (03CR) 10Slyngshede: signup:blocklist Expand blocklist feature (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 (owner: 10Slyngshede)
[11:04:45] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis)
[11:05:59] <mvolz>	 Does anyone mind if I use this empty window to deploy? 
[11:07:50] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.43:443]) https://wikitech.wikimedia.org/wiki/PyBal
[11:08:14] <_joe_>	 uh
[11:08:49] <_joe_>	 that's schema
[11:08:59] <mvolz>	 (it would be zotero)
[11:09:03] <_joe_>	 is someone doing something with lvs2009?
[11:09:18] <mvolz>	 not me :)
[11:09:34] <_joe_>	 mvolz: sorry I am looking at the alert right now
[11:09:39] <mvolz>	 np
[11:09:51] <_joe_>	 vgutierrez: topranks: ^^
[11:10:10] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.43:443]) https://wikitech.wikimedia.org/wiki/PyBal
[11:10:59] * topranks here 
[11:11:15] * vgutierrez already checking
[11:11:39] <_joe_>	 so it looks like the problem is every backend is depooled
[11:12:12] <vgutierrez>	 again? :)
[11:12:15] <_joe_>	 $ curl localhost:9090/pools/schema_443
[11:12:17] <_joe_>	 schema2004.codfw.wmnet:	disabled/up/not pooled
[11:12:19] <_joe_>	 schema2003.codfw.wmnet:	disabled/up/not pooled
[11:12:39] <_joe_>	 I guess someone has done something with those servers?
[11:13:22] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:13:37] <_joe_>	 I repooled 2003
[11:13:43] <vgutierrez>	 nothing on SAL AFAIK
[11:13:44] <_joe_>	 now we can check what caused it
[11:13:55] <_joe_>	 yeah we need to go look at the etcd logs I guess
[11:14:43] <vgutierrez>	 pybal noticed at 11:03:58
[11:15:17] <topranks>	 May 16 11:03:58 lvs2010 pybal[1983329]: [schema_443] INFO: Merged disabled server schema2004.codfw.wmnet, weight 10
[11:15:17] <topranks>	 May 16 11:03:58 lvs2010 pybal[1983329]: [schema_443] INFO: Merged disabled server schema2003.codfw.wmnet, weight 10
[11:15:20] <vgutierrez>	 yep
[11:15:42] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:16:24] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet
[11:17:53] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host testvm2004.codfw.wmnet with OS bookworm
[11:18:01] <_joe_>	 it was run from the servers
[11:18:04] <_joe_>	 found with
[11:18:07] <vgutierrez>	 and etcd as well
[11:18:10] <vgutierrez>	 May 16 11:03:58 conf2005 etcdmirror-conftool-eqiad-wmnet[5393]: [etcd-mirror] INFO: Replicating key /conftool/v1/pools/codfw/eventschemas/eventschemas/schema2004.codfw.wmnet at index 1952400
[11:18:11] <vgutierrez>	 May 16 11:13:08 conf2005 etcdmirror-conftool-eqiad-wmnet[5393]: [etcd-mirror] INFO: Replicating key /conftool/v1/pools/codfw/eventschemas/eventschemas/schema2003.codfw.wmnet at index 1952401
[11:18:15] <vgutierrez>	 _joe_: oh
[11:18:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bookworm
[11:18:20] <_joe_>	 sudo cumin 'conf1*' 'fgrep schema2004 /var/log/nginx/etcd_access.log | grep -v GET'
[11:18:29] <vgutierrez>	 so who ran that?
[11:18:38] <_joe_>	 someone or something ran "depool"
[11:18:42] <_joe_>	 on each server
[11:18:53] <_joe_>	 vgutierrez: can you check the cumin logs? I'll check the servers
[11:19:02] <vgutierrez>	 btullis: ^^
[11:19:12] <vgutierrez>	 btullis logged in at 11:03
[11:19:19] <vgutierrez>	 on schema2004
[11:19:31] <akosiaris>	 probably prep for codfw row D maint ? 
[11:19:49] <btullis>	 Yes, I depooled schema2004. Is there an issue?
[11:20:07] <_joe_>	 btullis: schema2003 was also depooled
[11:20:28] <btullis>	 Oh, sorry. I hadn't seen that.
[11:20:29] <_joe_>	 so I repooled it at 11:13
[11:20:43] <btullis>	 Many thanks _joe_ 
[11:20:45] <_joe_>	 ok, mistery solved anyways, we were worried some cronjob caused this
[11:20:53] <akosiaris>	 !log reboot rdb2007 for kernel upgrades: possibly affected apps: netbox, changeprop, cpjobqueue, api-gateway, redisLockManager. Should be harmless however
[11:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:59] <topranks>	 yep good stuff 
[11:21:12] <_joe_>	 but schema was unavailable for 10 minutes in codfw
[11:21:49] <_joe_>	 vgutierrez: topranks did you get paged?
[11:21:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 13 hosts with reason: maintenance
[11:21:53] <vgutierrez>	 nope
[11:22:02] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] service::catalog: add initial config for k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920218 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey)
[11:22:04] <_joe_>	 if not, we might want to add a paging probe on lvs for that service
[11:22:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 13 hosts with reason: maintenance
[11:22:21] <btullis>	 Oh that's me. I forgot to rebpool scheman2003 after row C upgrade. https://phabricator.wikimedia.org/T334049#8819429
[11:22:41] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add VIP records for the new k8s-ingress-ml-serve endpoint (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/920216 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey)
[11:22:45] <topranks>	 _joe_: didn’t get paged
[11:23:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 11 hosts with reason: maintenance
[11:23:09] <_joe_>	 btullis: I guess this is an actionable for you then
[11:23:23] <btullis>	 _joe_: Yes, I agree, that should page. I will add it.
[11:23:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 11 hosts with reason: maintenance
[11:23:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 14 hosts with reason: maintenance
[11:23:39] <_joe_>	 also - depool_threshold is clearly too low for schema
[11:23:46] <_joe_>	 pybal should protect against this
[11:23:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 14 hosts with reason: maintenance
[11:24:04] <topranks>	 +1 this should probably go to to victorops
[11:24:34] <_joe_>	 depool-threshold = .5 this is too low for a service with 2 servers 
[11:24:46] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet
[11:24:47] <_joe_>	 because it's computed before removing the server, not after
[11:25:14] <_joe_>	 I guess this is a two-line patch to service::catalog
[11:26:02] <btullis>	 Yes. Would `depool_threshold: ".6"` be OK to avoid this happening?
[11:26:14] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:29:49] <wikibugs>	 (03PS1) 10Effie Mouzeli: Revert "php-multiversion-base: update readme" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920229
[11:30:04] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Revert "php-multiversion-base: update readme" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920229 (owner: 10Effie Mouzeli)
[11:30:20] <wikibugs>	 (03PS2) 10Effie Mouzeli: Revert "php-multiversion-base: add rsvg-convert" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920228
[11:30:31] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Revert "php-multiversion-base: add rsvg-convert" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920228 (owner: 10Effie Mouzeli)
[11:30:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: service::catalog: followup to schema incident [puppet] - 10https://gerrit.wikimedia.org/r/920248
[11:30:48] <_joe_>	 btullis: ^^
[11:30:58] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host testvm2002.codfw.wmnet with OS bookworm
[11:32:20] <btullis>	 _joe_: Great, many thanks.
[11:32:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Many thanks for this change." [puppet] - 10https://gerrit.wikimedia.org/r/920248 (owner: 10Giuseppe Lavagetto)
[11:34:35] <wikibugs>	 (03PS1) 10KartikMistry: Updated MinT to 2023-05-16-112045-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920250 (https://phabricator.wikimedia.org/T336525)
[11:36:31] <_joe_>	 vgutierrez: fancy a pybal restart cycle?
[11:36:33] <_joe_>	 :P
[11:37:28] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-05-16-061239-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920251 (https://phabricator.wikimedia.org/T336657)
[11:38:24] <_joe_>	 mvolz: sorry, back to you - deploy whenever you want
[11:38:35] <mvolz>	 ty! 
[11:38:39] <_joe_>	 sorry for the delay but we had an ongoing outage 
[11:38:41] <mvolz>	 np
[11:38:47] <mvolz>	 it's not my window anyway :)
[11:39:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::catalog: followup to schema incident [puppet] - 10https://gerrit.wikimedia.org/r/920248 (owner: 10Giuseppe Lavagetto)
[11:39:16] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Update Zotero to most recent version [deployment-charts] - 10https://gerrit.wikimedia.org/r/920202 (https://phabricator.wikimedia.org/T336727) (owner: 10Mvolz)
[11:40:11] <wikibugs>	 (03Merged) 10jenkins-bot: Update Zotero to most recent version [deployment-charts] - 10https://gerrit.wikimedia.org/r/920202 (https://phabricator.wikimedia.org/T336727) (owner: 10Mvolz)
[11:43:07] * kart_ updating MinT and cxserver
[11:43:34] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-05-16-061239-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920251 (https://phabricator.wikimedia.org/T336657) (owner: 10KartikMistry)
[11:43:50] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[11:44:34] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-05-16-061239-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920251 (https://phabricator.wikimedia.org/T336657) (owner: 10KartikMistry)
[11:44:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/920214 (https://phabricator.wikimedia.org/T334455) (owner: 10Ladsgroup)
[11:44:54] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[11:45:36] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@2a0b1f2]: Regular analytics weekly train [analytics/refinery@2a0b1f2]
[11:46:01] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[11:46:20] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[11:47:22] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-codfw
[11:47:40] <_joe_>	 jouncebot: now
[11:47:40] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 12 minute(s)
[11:47:47] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I'd love for Puppet to manage that list for us. modules/profile/manifests/ssh/client.pp as some magic Puppet DB query." [puppet] - 10https://gerrit.wikimedia.org/r/919405 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[11:49:19] <kart_>	 akosiaris: There are some unapplied changes in cxserver - is that safe to deploy?
[11:49:29] <kart_>	 Probably also on MinT.
[11:49:36] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[11:49:40] <_joe_>	 kart_: what changes? if it's envoy-related, it's ok
[11:50:04] <_joe_>	 let me check
[11:50:08] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[11:50:15] <kart_>	 _joe_: looks like that only.
[11:50:28] <kart_>	 but, can you please check?
[11:50:33] <wikibugs>	 (03CR) 10Jaime Nuche: doc: temporary config for docs publishing from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[11:50:45] <marostegui>	 !log install 10.4.29 on db1151 T336462
[11:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:49] <stashbot>	 T336462: Compile and package MariaDB 10.4.29 - https://phabricator.wikimedia.org/T336462
[11:50:50] <_joe_>	 jayme: I think it's your changes in configuration to envoy
[11:51:10] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-codfw
[11:51:26] <_joe_>	 kart_: it should be ok to apply, go on
[11:52:08] <kart_>	 T300324. Yes. Thanks.
[11:52:09] <stashbot>	 T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324
[11:52:24] <_joe_>	 I'll restart the eqiad pybals after lunch
[11:52:28] <wikibugs>	 (03PS1) 10Effie Mouzeli: php-multiversion-base: add librsvg2-bin [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920257 (https://phabricator.wikimedia.org/T336025)
[11:52:41] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[11:53:16] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[11:55:16] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[11:55:28] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[11:55:50] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[11:56:20] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[11:56:21] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@2a0b1f2]: Regular analytics weekly train [analytics/refinery@2a0b1f2] (duration: 10m 45s)
[11:57:04] <XioNoX>	 !log stage upgrade on asw-d-codfw - T335042
[11:57:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:08] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[11:58:16] <wikibugs>	 (03PS1) 10Majavah: Add an option to disable NFS access [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/920259 (https://phabricator.wikimedia.org/T334081)
[11:58:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919273 (owner: 10Slyngshede)
[11:59:09] <kart_>	 !log Updated cxserver to 2023-05-16-061239-production (T336657)
[11:59:12] <wikibugs>	 (03CR) 10Elukey: Add VIP records for the new k8s-ingress-ml-serve endpoint (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/920216 (https://phabricator.wikimedia.org/T336726) (owner: 10Elukey)
[11:59:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:13] <stashbot>	 T336657: Enable MinT for Central Bikol in Content Translation - https://phabricator.wikimedia.org/T336657
[12:01:22] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Updated MinT to 2023-05-16-112045-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920250 (https://phabricator.wikimedia.org/T336525) (owner: 10KartikMistry)
[12:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: Updated MinT to 2023-05-16-112045-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920250 (https://phabricator.wikimedia.org/T336525) (owner: 10KartikMistry)
[12:02:51] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering, 10Event-Platform Value Stream: Allow gmodena and tchin to merge changes to operation/deployment-charts repo - https://phabricator.wikimedia.org/T336755 (10Ottomata)
[12:02:59] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[12:03:45] <wikibugs>	 10SRE-Access-Requests, 10Data-Engineering, 10Event-Platform Value Stream: Allow gmodena and tchin to merge changes to operation/deployment-charts repo - https://phabricator.wikimedia.org/T336755 (10Ottomata)
[12:04:40] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[12:06:08] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[12:09:08] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[12:14:43] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: temporarily remove dns2002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/919847 (https://phabricator.wikimedia.org/T335042) (owner: 10Ssingh)
[12:15:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:15:10] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@2a0b1f2] (thin): Regular analytics weekly train THIN [analytics/refinery@2a0b1f2]
[12:15:21] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@2a0b1f2] (thin): Regular analytics weekly train THIN [analytics/refinery@2a0b1f2] (duration: 00m 10s)
[12:15:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[12:17:43] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ssingh)
[12:18:15] <wikibugs>	 (03PS5) 10Klausman: helmfile.d: add revertrisk model config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/920208 (https://phabricator.wikimedia.org/T333124)
[12:19:04] <wikibugs>	 (03PS1) 10Ssingh: depool codfw for row D switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/920265 (https://phabricator.wikimedia.org/T335042)
[12:19:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) > +1 extending the lifetime is just delaying the issue and increasing the possibility its forgotten or missed  Yes and no. It depends on how much we can automate it with...
[12:19:54] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:20:08] <sukhe>	 ^ expected
[12:20:19] <_joe_>	 jouncebot: now
[12:20:19] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 39 minute(s)
[12:20:39] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[12:21:18] <kart_>	 !Updated MinT to 2023-05-16-112045-production (T336525, T336649, T336483, T336349)
[12:21:18] <stashbot>	 T336649: English not listed as target langauge in the UI - https://phabricator.wikimedia.org/T336649
[12:21:19] <stashbot>	 T336525: Review code mappings for MinT - https://phabricator.wikimedia.org/T336525
[12:21:19] <stashbot>	 T336349: Replace MinT dropdowns with ULS - https://phabricator.wikimedia.org/T336349
[12:21:19] <stashbot>	 T336483: Long sequence of a repeated word appears only when using MinT but not NLLB-200 directly - https://phabricator.wikimedia.org/T336483
[12:21:27] <XioNoX>	 !log disable ping offload in codfw - T335042
[12:21:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:31] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[12:21:44] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:21:47] <_joe_>	 kart_: this is not a great time to deploy your code
[12:21:50] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:22:18] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] depool codfw for row D switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/920265 (https://phabricator.wikimedia.org/T335042) (owner: 10Ssingh)
[12:22:24] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:22:32] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[12:22:34] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[12:22:37] <sukhe>	 ^ expected
[12:22:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] depool codfw for row D switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/920265 (https://phabricator.wikimedia.org/T335042) (owner: 10Ssingh)
[12:22:51] <sukhe>	 !log running authdns-update to disable codfw for switch upgrade: T335042
[12:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:31] <sukhe>	 !log [done] running authdns-update to disable codfw for switch upgrade: T335042
[12:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:24] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@2a0b1f2] (thin): Regular analytics weekly train THIN [analytics/refinery@2a0b1f2]
[12:27:29] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@2a0b1f2] (thin): Regular analytics weekly train THIN [analytics/refinery@2a0b1f2] (duration: 00m 04s)
[12:28:07] <logmsgbot>	 !log joal@deploy1002 Started deploy [analytics/refinery@2a0b1f2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2a0b1f2]
[12:29:37] <logmsgbot>	 !log joal@deploy1002 Finished deploy [analytics/refinery@2a0b1f2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2a0b1f2] (duration: 01m 30s)
[12:31:54] <joal>	 btullis: I need you again :S We have not documented the solution to overcome the git issue we're having when deploying onto HDFS - can you tell me the trick again (I forgot :S)
[12:35:00] <godog>	 !log start cadvisor 0.44 upgrade to buster hosts - T336740
[12:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:04] <stashbot>	 T336740: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740
[12:36:12] <joal>	 woop wrong chan :S
[12:36:53] <wikibugs>	 (03PS1) 10Gmodena: mediawiki-page-content-change-enrichment: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656)
[12:37:33] <kart_>	 _joe_: Did I miss anything?
[12:39:04] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1009.eqiad.wmnet
[12:39:20] <akosiaris>	 !log reboot rdb1009 for kernel upgrades: possibly affected apps: netbox, changeprop, cpjobqueue, api-gateway, redisLockManager. Should be harmless however
[12:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:56] <_joe_>	 kart_: sorry, I thought you were *about* to deploy your stuff and there is a maintenance going on in one of the datacenters
[12:44:09] <kart_>	 _joe_: I was done with it :)
[12:44:37] <_joe_>	 kart_: yeah I realized in the meantime :)
[12:44:53] <_joe_>	 I was trying to save you from a possible deployment failure
[12:44:56] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:45:16] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1009.eqiad.wmnet
[12:46:54] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 189 hosts with reason: codfw row D upgrade
[12:47:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 236.7k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[12:47:10] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:48:35] <jayme>	 kart_: oh, sorry. I only deployed staging before lunch - did not anticipate someone deploying cxserver the next hour
[12:48:56] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 189 hosts with reason: codfw row D upgrade
[12:49:11] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3a841f97-aecd-4c7a-8eb4-8acd1caa15b3) set by ayounsi@cumin1001 for 2:00:00 on 189 host(...
[12:50:06] <moritzm>	 !log disabling Puppet in codfw/esams/ulsfo for switch maintenance T335042
[12:50:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:11] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[12:51:27] <Emperor>	 !log depool thanos-fe2003 T335042
[12:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:13] <Emperor>	 !log depool ms-fe2012 T335042
[12:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:17] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MatthewVernon)
[12:54:09] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff)
[12:55:42] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:42] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1300)
[13:00:04] <jouncebot>	 mazevedo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1300)
[13:00:17] <mazevedo>	 hi!
[13:00:56] <XioNoX>	 fyi, we're going to start a 20/30min maintenance, please hold any deployment if possible
[13:01:16] <XioNoX>	 !log asw-d-codfw> request system reboot all-members - T335042
[13:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:21] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[13:02:47] <taavi>	 I can deploy once XioNoX is done with the network maintenance
[13:03:05] <taavi>	 XioNoX: please add these to the deployment calendar the next time
[13:03:30] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:03:30] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 462.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:04:33] <mazevedo>	 ok
[13:04:36] <icinga-wm>	 PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[13:05:41] <XioNoX>	 taavi: last time I looked I didn't understand how it worked, and didn't go far enough in the future to add them
[13:06:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:06:10] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:06:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 135, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:06:30] <jinxer-wm>	 (Emergency syslog message) firing: Alert for device asw-d-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[13:06:30] <jinxer-wm>	 (virtual-chassis crash) firing: Alert for device asw-d-codfw.mgmt.codfw.wmnet - virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[13:06:34] <icinga-wm>	 PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 2 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[13:06:44] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 25.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:07:08] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[13:08:16] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[13:08:16] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[13:08:21] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[13:08:33] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:08:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[13:08:49] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[13:08:53] <wikibugs>	 (03PS1) 10Daniel Kinzler: Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347)
[13:09:14] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:09:23] <wikibugs>	 (03PS1) 10Daniel Kinzler: Use MultiHttpClient instead of VirtualRESTService. [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920231 (https://phabricator.wikimedia.org/T335347)
[13:11:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:11:30] <jinxer-wm>	 (Emergency syslog message) resolved: Device asw-d-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[13:12:00] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is CRITICAL: Test Suggest target section titles for given source sections returned the unexpected status 503 (expecting: 200): /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia
[13:12:00] <icinga-wm>	 i/CX
[13:12:08] <icinga-wm>	 PROBLEM - configured eth on lvs2011 is CRITICAL: vlan2020 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[13:12:52] <icinga-wm>	 RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.84 ms
[13:13:16] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:13:32] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:13:32] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:13:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:13:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[13:13:54] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[13:14:16] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:14:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[13:14:30] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:14:38] <icinga-wm>	 RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[13:15:08] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[13:16:03] <jinxer-wm>	 (ProbeDown) resolved: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:16:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[13:17:36] <wikibugs>	 (03PS2) 10Majavah: Add stream config for mobile apps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919372 (https://phabricator.wikimedia.org/T336508) (owner: 10Mazevedo)
[13:17:43] <wikibugs>	 (03PS3) 10Majavah: Add stream config for mobile apps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919372 (https://phabricator.wikimedia.org/T336508) (owner: 10Mazevedo)
[13:18:16] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[13:18:16] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[13:18:21] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[13:18:32] <jinxer-wm>	 (JobUnavailable) resolved: (6) Reduced availability for job chartmuseum in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:19:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[13:20:10] <taavi>	 XioNoX: can we go ahead with the deployment window or are things still recovering?
[13:21:30] <jinxer-wm>	 (virtual-chassis crash) resolved: Device asw-d-codfw.mgmt.codfw.wmnet recovered from virtual-chassis crash   - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash
[13:21:35] <XioNoX>	 taavi: yep, everything is good now!
[13:21:40] <taavi>	 thanks!
[13:21:48] <XioNoX>	 thanks for waiting
[13:21:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[13:21:54] <wikibugs>	 (03PS1) 10Ssingh: Revert "hiera: temporarily remove dns2002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/920232
[13:22:09] <taavi>	 mazevedo: deploying your patch now
[13:22:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919372 (https://phabricator.wikimedia.org/T336508) (owner: 10Mazevedo)
[13:22:23] <mazevedo>	 awesome! let me know when to test
[13:22:58] <wikibugs>	 (03Merged) 10jenkins-bot: Add stream config for mobile apps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919372 (https://phabricator.wikimedia.org/T336508) (owner: 10Mazevedo)
[13:23:16] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:23:29] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:919372|Add stream config for mobile apps schema (T336508)]]
[13:23:34] <stashbot>	 T336508: Add MobileWikiAppiOSNavigationEvents to MEP - https://phabricator.wikimedia.org/T336508
[13:23:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:24:07] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] vrts1001: Switch to insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/919856 (owner: 10Alexandros Kosiaris)
[13:24:38] <wikibugs>	 (03PS1) 10Ssingh: Revert "depool codfw for row D switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/920233
[13:24:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[13:25:23] <moritzm>	 !log enabled Puppet in codfw/esams/ulsfo for switch maintenance T335042
[13:25:24] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:25:24] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:27] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[13:25:29] <logmsgbot>	 !log taavi@deploy1002 mazevedo and taavi: Backport for [[gerrit:919372|Add stream config for mobile apps schema (T336508)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:25:30] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:25:35] <taavi>	 mazevedo: please test!
[13:25:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539 (10ayounsi)
[13:25:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) 05Open→03Resolved a:03ayounsi All stacks have been upgraded. Hopefully for the last time!
[13:26:03] <wikibugs>	 10SRE, 10Observability-Metrics, 10serviceops, 10User-fgiunchedi: Upgrade cadvisor to 0.44 fleetwide - https://phabricator.wikimedia.org/T336740 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is completed, we're running cadvisor `0.44.0+ds1-1~wmf1` on buster and bullseye
[13:26:05] <wikibugs>	 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi)
[13:26:11] <mazevedo>	 taavi it's working, thanks!
[13:26:14] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:26:19] <logmsgbot>	 !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica2006.wikimedia.org
[13:26:28] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:26:46] <taavi>	 ok, syncing
[13:28:03] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MoritzMuehlenhoff)
[13:28:30] <wikibugs>	 10SRE, 10User-MoritzMuehlenhoff: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10ayounsi)
[13:28:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[13:28:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) 05Stalled→03Resolved a:03ayounsi Done with all the sub-tasks upgrades.
[13:29:08] <wikibugs>	 (03PS6) 10JMeybohm: envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus)
[13:29:39] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "depool codfw for row D switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/920233 (owner: 10Ssingh)
[13:30:22] <sukhe>	 !log running authdns-update to repool codfw
[13:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:36] <Emperor>	 !log repool thanos-fe2003 T335042
[13:32:38] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:919372|Add stream config for mobile apps schema (T336508)]] (duration: 09m 08s)
[13:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:40] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[13:32:44] <stashbot>	 T336508: Add MobileWikiAppiOSNavigationEvents to MEP - https://phabricator.wikimedia.org/T336508
[13:32:45] <taavi>	 mazevedo: all done
[13:32:49] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily remove dns2002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/920232 (owner: 10Ssingh)
[13:33:11] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2003.codfwm.wmnet,service=thanos-web
[13:33:33] <logmsgbot>	 !log mvernon@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe2003.codfw.wmnet,service=thanos-web
[13:34:16] <wikibugs>	 (03PS3) 10Ssingh: pybal/lvs: remove backward compatibility for buster [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309)
[13:34:57] <wikibugs>	 (03Merged) 10jenkins-bot: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[13:37:02] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.95 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:37:03] <wikibugs>	 (03CR) 10D3r1ck01: "I thought Subbu already created a revert: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/919309?" [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler)
[13:37:48] <wikibugs>	 (03PS2) 10JMeybohm: envoy: Move upstream HTTP config into the new HttpProtocolOptions message [puppet] - 10https://gerrit.wikimedia.org/r/916498 (https://phabricator.wikimedia.org/T303230)
[13:38:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] envoyproxy: Migrate from access_log_path field to FileAccessLog API [puppet] - 10https://gerrit.wikimedia.org/r/769807 (https://phabricator.wikimedia.org/T303231) (owner: 10RLazarus)
[13:38:55] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41201/console" [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:39:07] <wikibugs>	 (03CR) 10DCausse: search: Add alert based on age of titlesuggest indices (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson)
[13:39:25] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=eventschemas,dc=codfw,name=schema2004.eqiad.wmnet
[13:39:45] <logmsgbot>	 !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=eventschemas,dc=codfw,name=schema2004.codfw.wmnet
[13:42:40] <icinga-wm>	 RECOVERY - configured eth on lvs2011 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[13:44:48] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 356.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:45:29] <_joe_>	 vgutierrez: going to restart pybals in eqiad, FYI
[13:45:36] <logmsgbot>	 !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-eqiad
[13:46:04] <wikibugs>	 (03PS1) 10Volans: users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279
[13:46:05] <Emperor>	 !log repool ms-fe2012 T335042
[13:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:10] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[13:46:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1001.eqiad.wmnet with OS bullseye
[13:46:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with...
[13:46:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279 (owner: 10Volans)
[13:47:10] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[13:48:40] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.makevm call reimage after VM creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede)
[13:48:44] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[13:49:32] <logmsgbot>	 !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-eqiad
[13:51:20] <wikibugs>	 (03PS1) 10Ssingh: Revert "Revert "dns2005: add Puppet role and DNS/NTP configs"" [puppet] - 10https://gerrit.wikimedia.org/r/920234
[13:52:55] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "Revert "dns2005: add Puppet role and DNS/NTP configs"" [puppet] - 10https://gerrit.wikimedia.org/r/920234 (owner: 10Ssingh)
[13:53:39] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] php-multiversion-base: add librsvg2-bin [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920257 (https://phabricator.wikimedia.org/T336025) (owner: 10Effie Mouzeli)
[13:53:45] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye
[13:53:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2005.wikimedia.org with OS bullseye
[13:54:21] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches upgrade done - T335042
[13:54:24] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[13:54:36] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches upg...
[13:55:17] <wikibugs>	 (03PS5) 10Herron: role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277)
[13:57:03] <wikibugs>	 (03PS1) 10Btullis: Add the refinery-cache directory to the git safe list [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493)
[13:57:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 204.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[13:57:22] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[13:58:35] <wikibugs>	 (03CR) 10Btullis: "The latest errors are mentioned here: https://phabricator.wikimedia.org/T334493#8855435" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[13:58:52] <wikibugs>	 (03PS2) 10Volans: users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279
[13:58:54] <wikibugs>	 (03PS1) 10Volans: login: use the key type speficied in the config [homer/public] - 10https://gerrit.wikimedia.org/r/920281
[13:58:56] <wikibugs>	 (03PS1) 10AikoChou: changeprop: add liftwing outlink topic stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899)
[13:59:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[13:59:32] <wikibugs>	 (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: Use ActionApiImageRecommendationApiHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920283 (https://phabricator.wikimedia.org/T335641)
[13:59:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] vrts1001: Switch to insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/919856 (owner: 10Alexandros Kosiaris)
[13:59:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/919856 (owner: 10Alexandros Kosiaris)
[13:59:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] login: use the key type speficied in the config [homer/public] - 10https://gerrit.wikimedia.org/r/920281 (owner: 10Volans)
[14:00:44] <wikibugs>	 (03CR) 10Volans: [C: 03+2] login: use the key type speficied in the config [homer/public] - 10https://gerrit.wikimedia.org/r/920281 (owner: 10Volans)
[14:00:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279 (owner: 10Volans)
[14:00:53] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] gerrit: remove gerrit1001 as a source host for migrations [puppet] - 10https://gerrit.wikimedia.org/r/919400 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[14:01:19] <wikibugs>	 (03Merged) 10jenkins-bot: login: use the key type speficied in the config [homer/public] - 10https://gerrit.wikimedia.org/r/920281 (owner: 10Volans)
[14:01:22] <wikibugs>	 (03Merged) 10jenkins-bot: users: change my own SSH key to test ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920279 (owner: 10Volans)
[14:02:20] <wikibugs>	 (03PS2) 10David Caro: Revert "Revert "toolforge_cli: add api gateway url and builds endpoint"" [puppet] - 10https://gerrit.wikimedia.org/r/918544
[14:02:22] <wikibugs>	 (03CR) 10Btullis: "PCC doesn't like it :-(" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[14:02:35] <wikibugs>	 (03PS3) 10David Caro: toolforge_cli: add api gateway url and builds endpoint [puppet] - 10https://gerrit.wikimedia.org/r/918544
[14:05:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] Add the refinery-cache directory to the git safe list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[14:05:44] <wikibugs>	 (03PS2) 10Btullis: Add the refinery-cache directory to the git safe list [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493)
[14:05:48] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add the refinery-cache directory to the git safe list [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[14:06:20] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:06:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:06:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2005.wikimedia.org with reason: host reimage
[14:06:51] <wikibugs>	 (03CR) 10Daniel Kinzler: Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" (031 comment) [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler)
[14:07:28] <wikibugs>	 (03PS1) 10Effie Mouzeli: minor fix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920290
[14:08:32] <wikibugs>	 (03PS3) 10Btullis: Add the refinery-cache directory to the git safe list [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493)
[14:08:55] <wikibugs>	 (03CR) 10D3r1ck01: Revert "Revert "Add getMultiHttpClient function to make HTTP requests to Mathoid."" (031 comment) [extensions/Math] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920230 (https://phabricator.wikimedia.org/T335347) (owner: 10Daniel Kinzler)
[14:10:02] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2005.wikimedia.org with reason: host reimage
[14:10:25] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row D switches upg...
[14:10:45] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in codfw: codfw row D switches upgrade done - T335042
[14:10:49] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[14:11:13] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] minor fix [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/920290 (owner: 10Effie Mouzeli)
[14:11:19] <wikibugs>	 (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[14:11:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:11:43] <wikibugs>	 (03CR) 10Andrew Bogott: "This needs further clarification as we can no longer distinguish between the VM range and the new range that will include cloudcontrols.  " [puppet] - 10https://gerrit.wikimedia.org/r/919292 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[14:11:59] <wikibugs>	 (03PS1) 10Cathal Mooney: Add a new aggregate network for the cloud-private 'supernet' [puppet] - 10https://gerrit.wikimedia.org/r/920291 (https://phabricator.wikimedia.org/T324992)
[14:14:20] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[14:15:09] <wikibugs>	 (03PS2) 10Cathal Mooney: Add a new aggregate network for the cloud-private 'supernet' [puppet] - 10https://gerrit.wikimedia.org/r/920291 (https://phabricator.wikimedia.org/T324992)
[14:16:44] <wikibugs>	 (03PS1) 10Ssingh: config/common.yaml: update SSH key for sukhe (switch to ed25519) [homer/public] - 10https://gerrit.wikimedia.org/r/920292 (https://phabricator.wikimedia.org/T336769)
[14:17:40] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided)
[14:18:26] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@0c82f2d] (releasing): (no justification provided) (duration: 00m 45s)
[14:18:57] <wikibugs>	 (03CR) 10SBassett: [V: 03+1] "Seems right to me, though I'm, at best, a puppet novice." [puppet] - 10https://gerrit.wikimedia.org/r/920194 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui)
[14:19:54] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] "Merging based on previous vote." [puppet] - 10https://gerrit.wikimedia.org/r/920280 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis)
[14:19:59] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 51.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:20:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] config/common.yaml: update SSH key for sukhe (switch to ed25519) [homer/public] - 10https://gerrit.wikimedia.org/r/920292 (https://phabricator.wikimedia.org/T336769) (owner: 10Ssingh)
[14:20:50] <wikibugs>	 (03PS3) 10Bking: rdf-streaming-updater@staging: upgrade to flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse)
[14:21:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] config/common.yaml: update SSH key for sukhe (switch to ed25519) [homer/public] - 10https://gerrit.wikimedia.org/r/920292 (https://phabricator.wikimedia.org/T336769) (owner: 10Ssingh)
[14:22:01] <wikibugs>	 (03Merged) 10jenkins-bot: config/common.yaml: update SSH key for sukhe (switch to ed25519) [homer/public] - 10https://gerrit.wikimedia.org/r/920292 (https://phabricator.wikimedia.org/T336769) (owner: 10Ssingh)
[14:24:45] <hashar>	 jouncebot: now
[14:24:45] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 35 minute(s)
[14:24:49] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10herron)
[14:25:00] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater@staging: upgrade to flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse)
[14:25:24] <wikibugs>	 10SRE, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype, 10SecTeam-Processed: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10sbassett) 05Open→03Declined I think the current incarnation of the #security-team would...
[14:25:45] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater@staging: upgrade to flink 1.16.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/911861 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse)
[14:26:38] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went very well. Thanks everybody! That was the last one!
[14:26:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2005.wikimedia.org with OS bullseye
[14:26:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[14:26:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[14:26:51] <logmsgbot>	 !log bking@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[14:26:56] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[14:27:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2005.wikimedia.org with OS bullseye completed: - dns2005 (**PASS**)...
[14:27:13] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[14:27:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[14:29:23] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10Albertoleoncio)
[14:30:11] <wikibugs>	 (03PS15) 10Ottomata: New wikikube service: mediawiki-page-content-change-enrichment - staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/895241 (https://phabricator.wikimedia.org/T325303)
[14:30:18] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[14:30:26] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[14:31:29] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:31:47] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:32:07] <wikibugs>	 (03CR) 10Herron: [C: 03+2] role::webperf::profiling_tools: add redis instance for arclamp [puppet] - 10https://gerrit.wikimedia.org/r/919164 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron)
[14:32:24] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[14:32:28] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[14:32:33] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[14:33:41] <wikibugs>	 (03PS1) 10JMeybohm: users: Update my SSH key to a ed25519 one [homer/public] - 10https://gerrit.wikimedia.org/r/920295 (https://phabricator.wikimedia.org/T336769)
[14:35:18] <wikibugs>	 (03PS1) 10Guergana Tzatchkova: Enable wmgWikibaseTmpWbsubscribersSensibleOutput on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920296 (https://phabricator.wikimedia.org/T336760)
[14:36:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: trafficserver: allow partial traffic flow to mw on k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038) (owner: 10Giuseppe Lavagetto)
[14:36:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: make mw-on-k8s use a config file [puppet] - 10https://gerrit.wikimedia.org/r/917840 (https://phabricator.wikimedia.org/T336037) (owner: 10Giuseppe Lavagetto)
[14:36:50] <logmsgbot>	 !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:37:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: allow partial traffic flow to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038) (owner: 10Giuseppe Lavagetto)
[14:37:23] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: trafficserver: allow partial traffic flow to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/917841 (https://phabricator.wikimedia.org/T336038)
[14:38:44] <_joe_>	 sigh come on jenkinsss
[14:39:04] <wikibugs>	 (03PS2) 10Ssingh: sites.yaml: add new dns host dns2005 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/919876 (https://phabricator.wikimedia.org/T326688)
[14:39:58] <wikibugs>	 (03PS1) 10Herron: arclamp: switch redis server to arclamp1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920298 (https://phabricator.wikimedia.org/T327277)
[14:42:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns2005 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/919876 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh)
[14:42:10] <wikibugs>	 (03PS1) 10Herron: arclamp: switch redis server to arclamp1001 [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277)
[14:42:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] conftool-data: add discovery config for the k8s-ingress-ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/920215 (owner: 10Elukey)
[14:42:36] <sukhe>	 !log "cr*-codfw*" commit "Gerrit: 919876 add new DNS host dns2005": T326688
[14:42:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:41] <stashbot>	 T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688
[14:42:46] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[14:43:20] <hashar>	 !log Restarting CI Jenkins
[14:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:28] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[14:47:35] <logmsgbot>	 !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:48:44] <sukhe>	 !log [done] "cr*-codfw*" commit "Gerrit: 919876 add new DNS host dns2005": T326688
[14:48:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:48] <stashbot>	 T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688
[14:49:08] <moritzm>	 !log installing libxml2 security updates on buster
[14:49:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:31] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 574.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:51:49] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336538 (10Jhancock.wm) power cord in PSU1 was replaced and secured. alert has cleared
[14:51:51] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10RhinosF1) @Samwilson: assuming this is following https://wikitech.wikimedia.org/wiki/Volunteer_NDA, please get your manager to comment an...
[14:51:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: trafficserver: actually carry over the config file [puppet] - 10https://gerrit.wikimedia.org/r/920300
[14:52:47] <wikibugs>	 (03PS1) 10Jgreen: users: change my own SSH key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920301 (https://phabricator.wikimedia.org/T336769)
[14:53:29] <wikibugs>	 (03PS1) 10Ssingh: hiera: add new DNS host dns2005 [puppet] - 10https://gerrit.wikimedia.org/r/920302 (https://phabricator.wikimedia.org/T326688)
[14:55:02] <logmsgbot>	 !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:55:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: actually carry over the config file [puppet] - 10https://gerrit.wikimedia.org/r/920300 (owner: 10Giuseppe Lavagetto)
[14:56:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: add new DNS host dns2005 [puppet] - 10https://gerrit.wikimedia.org/r/920302 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh)
[14:57:02] <wikibugs>	 (03PS1) 10Jgreen: Change my own SSH key to ed25519 [puppet] - 10https://gerrit.wikimedia.org/r/920303
[14:58:16] <wikibugs>	 (03PS2) 10Jgreen: Change my own SSH key to ed25519 [puppet] - 10https://gerrit.wikimedia.org/r/920303
[14:58:32] <wikibugs>	 (03CR) 10Pmiazga: [C: 03+1] rest-gateway: don't append when setting headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/917340 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[14:58:38] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: use correct image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/920304 (https://phabricator.wikimedia.org/T334244)
[14:59:37] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudswift1001.eqiad.wmnet with OS bullseye
[14:59:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS...
[15:00:10] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: use correct image path [deployment-charts] - 10https://gerrit.wikimedia.org/r/920304 (https://phabricator.wikimedia.org/T334244) (owner: 10Bking)
[15:01:25] <wikibugs>	 (03PS1) 10Guergana Tzatchkova: Enable wmgWikibaseTmpEnableLabelsInApiSummaries on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920306 (https://phabricator.wikimedia.org/T335099)
[15:03:52] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:04:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:07:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:16:54] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:17:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:18:43] <hashar>	 !log CI Jenkins jobs are stall following the plugins upgrade :/
[15:18:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:26:14] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:26:33] <Emperor>	 !log rebalance codfw swift rings T335280
[15:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:37] <stashbot>	 T335280: Drain and then decommission ms-be20[40-43] - https://phabricator.wikimedia.org/T335280
[15:27:52] <hashar>	 !log Restarting CI Jenkins
[15:27:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:08] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:32:16] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:33:18] <sukhe>	 !log set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.10 208.80.153.48 208.80.153.74 ]
[15:33:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:32] <sukhe>	 !log set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.10 208.80.153.48 208.80.153.74 ]: T326688
[15:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:35] <stashbot>	 T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688
[15:36:00] <hashar>	 !log Some CI jobs started failing after an upgrade of some Jenkins plugins. I have upgraded a couple more and it seems to work now T336775
[15:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:05] <stashbot>	 T336775: Jenkins CI job castor-save-workspace-cache stall breaking the whole CI - https://phabricator.wikimedia.org/T336775
[15:41:44] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@7fa2dcd]: Regular analytics weekly train [airflow-dags@7fa2dcd]
[15:41:54] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@7fa2dcd]: Regular analytics weekly train [airflow-dags@7fa2dcd] (duration: 00m 10s)
[15:49:53] <sukhe>	 !log run authdns-update for CR 920314
[15:49:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:05] <jouncebot>	 jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:06:43] <mutante>	 !log gitlab-runner2003 - installed rsync client for debugging an issue with rsync from inside containers, comparing to from outside container
[16:06:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:38] <sukhe>	 is it me or is wikibugs not active
[16:14:26] <rzl>	 I feel like it's been flaky the last few days
[16:15:00] <sukhe>	 I have a mental habit of using it to keep track of the order of changes
[16:15:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:24:48] <sukhe>	 would someone with the right permissions please restart wikibugs? thank you :)
[16:30:01] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:30:01] <volans>	 !log restarting wikibugs ( https://www.mediawiki.org/wiki/Wikibugs#Help )
[16:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:13] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:30:18] <wikibugs_>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: be more specific about password ACL [puppet] - 10https://gerrit.wikimedia.org/r/920325 (https://phabricator.wikimedia.org/T336723)
[16:30:23] <volans>	 sukhe: ^^
[16:30:55] <sukhe>	 thank you, I couldn't login to toolforge for some reason
[16:31:03] <sukhe>	 I did know about the link
[16:31:33] <volans>	 couple of old pods are taking long time to be terminated
[16:31:39] <volans>	 keeping an eye on them for now
[16:32:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack: keystone: be more specific about password ACL [puppet] - 10https://gerrit.wikimedia.org/r/920325 (https://phabricator.wikimedia.org/T336723) (owner: 10Arturo Borrero Gonzalez)
[16:32:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "not interested in merging this, it was just a PoC" [puppet] - 10https://gerrit.wikimedia.org/r/920325 (https://phabricator.wikimedia.org/T336723) (owner: 10Arturo Borrero Gonzalez)
[16:33:57] <wikibugs>	 (03CR) 10Pmiazga: [C: 03+1] rest-gateway: don't append when setting headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917340 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[16:36:51] <wikibugs>	 (03PS4) 10EoghanGaffney: Change doc hosts to use rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/920310 (https://phabricator.wikimedia.org/T333945)
[16:37:25] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: don't append when setting headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917340 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[16:40:06] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:41:40] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:41:53] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41204/console" [puppet] - 10https://gerrit.wikimedia.org/r/920310 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney)
[16:43:57] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata)
[16:44:07] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 A): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Thanks @hnowlan took me a bit to find this, but I did and we adde...
[16:44:37] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: Add a new aggregate network for the cloud-private 'supernet' [puppet] - 10https://gerrit.wikimedia.org/r/920291 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney)
[16:50:33] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: keystone: service: allow cloud-private supernet [puppet] - 10https://gerrit.wikimedia.org/r/920348 (https://phabricator.wikimedia.org/T336723)
[16:53:59] <wikibugs>	 (03PS1) 10Volans: sre.network.provision: bugfix and improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485)
[16:55:12] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor100[1256].eqiad.wmnet
[16:56:07] <wikibugs>	 (03PS1) 10Ssingh: hiera: decommission dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/920350 (https://phabricator.wikimedia.org/T335777)
[16:57:31] <wikibugs>	 (03PS1) 10Dwisehaupt: config/common.yaml: update SSH key for dwisehaupt [homer/public] - 10https://gerrit.wikimedia.org/r/920351 (https://phabricator.wikimedia.org/T336769)
[16:58:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns2002 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/920320 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[16:59:49] <moritzm>	 !log installing 5.10.179 kernels on Bullseye hosts
[16:59:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:02] <wikibugs>	 (03PS2) 10Dwisehaupt: users: update SSH key for dwisehaupt [homer/public] - 10https://gerrit.wikimedia.org/r/920351 (https://phabricator.wikimedia.org/T336769)
[17:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1700)
[17:00:12] <sukhe>	 !log homer "cr*-codfw*" commit "Gerrit: 920320 remove to-be decommissioned host dns2002" T335777
[17:00:14] <wikibugs>	 (03CR) 10Cathal Mooney: "lgtm, just one small typo I think" [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[17:00:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:19] <stashbot>	 T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777
[17:01:08] <wikibugs>	 (03PS2) 10Volans: sre.network.provision: bugfix and improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485)
[17:01:17] <wikibugs>	 (03CR) 10Volans: "good catch, thx, addressed comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[17:02:44] <wikibugs>	 (03PS2) 10Ssingh: hiera: decommission dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/920350 (https://phabricator.wikimedia.org/T335777)
[17:03:08] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[17:03:56] <wikibugs>	 (03PS2) 10AikoChou: changeprop: add liftwing outlink topic stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899)
[17:04:13] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.network.provision: bugfix and improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[17:05:20] <wikibugs>	 (03CR) 10AikoChou: changeprop: add liftwing outlink topic stream (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[17:05:26] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns2002 [puppet] - 10https://gerrit.wikimedia.org/r/920350 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[17:06:44] <wikibugs>	 (03Merged) 10jenkins-bot: sre.network.provision: bugfix and improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/920349 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[17:09:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns2002.wikimedia.org
[17:10:52] <mbsantos>	 Hey jayme I was about to start a deployment for mobileapps service but I see you're working on mesh upgrade at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/919849, I see it's not deployed yet. Should I hold the image bumping for mobileapps?
[17:12:13] <jayme>	 mbsantos: thanks for reaching out! Please feel free to deploy the change with your image bump, there is no change in behaviour expected (and non seen as of now ;))
[17:12:46] <mbsantos>	 thanks!
[17:14:55] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[17:16:57] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[17:17:11] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[17:17:12] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[17:17:13] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10RESTBase, 10Traffic, and 5 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10FJoseph-WMF) I've scheduled a meeting this week for followup
[17:18:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[17:18:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:18:06] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns2002.wikimedia.org
[17:18:15] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns2002.wikimedia.org` - dns2002.wikimedia.org (**WARN**)   - Downtime...
[17:19:06] <wikibugs>	 (03PS1) 10BCornwall: pybal: Switch drmrs LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797)
[17:19:07] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin1001"
[17:19:10] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh)
[17:19:36] <wikibugs>	 (03PS1) 10MSantos: mobileapps: bump to 023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354
[17:19:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mobileapps: bump to 023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354 (owner: 10MSantos)
[17:20:07] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin1001"
[17:20:08] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:20:56] <wikibugs>	 (03PS1) 10Ssingh: hiera: remove obsolete dns2001.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/920355 (https://phabricator.wikimedia.org/T335777)
[17:21:24] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: remove obsolete dns2001.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/920355 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[17:21:34] <wikibugs>	 (03PS2) 10MSantos: mobileapps: bump to 22023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354
[17:21:57] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41205/console" [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[17:22:58] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 22023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354 (owner: 10MSantos)
[17:23:53] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to 22023-05-08-112354-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/920354 (owner: 10MSantos)
[17:24:08] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41206/console" [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[17:24:14] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[17:24:58] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] pybal: Switch drmrs LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[17:26:20] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin1001"
[17:27:18] <brett>	 !log Rolling out maglev LVS scheduler in drmrs - T263797
[17:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:21] <stashbot>	 T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797
[17:27:23] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin1001"
[17:27:23] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:27:23] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[17:29:13] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] pybal: Switch drmrs LVS to use Maglev scheduler [puppet] - 10https://gerrit.wikimedia.org/r/920353 (https://phabricator.wikimedia.org/T263797) (owner: 10BCornwall)
[17:34:05] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@7816937]: Regular analytics weekly train - Hotfix [airflow-dags@7816937]
[17:34:16] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@7816937]: Regular analytics weekly train - Hotfix [airflow-dags@7816937] (duration: 00m 10s)
[17:37:15] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.network.provision for device ssw1-a8-codfw.mgmt.codfw.wmnet
[17:37:16] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[17:39:14] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin1001"
[17:40:14] <moritzm>	 !log installing avahi security updates on buster
[17:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:18] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin1001"
[17:40:19] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:40:47] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920237 (https://phabricator.wikimedia.org/T317375)
[17:40:54] <wikibugs>	 (03PS1) 10Ssingh: dns2006: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/920357 (https://phabricator.wikimedia.org/T326688)
[17:41:03] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920238 (https://phabricator.wikimedia.org/T317375)
[17:41:41] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add new dns host dns2006 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/920358 (https://phabricator.wikimedia.org/T326688)
[17:41:42] <wikibugs>	 10SRE, 10Traffic: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 (10ssingh) a:05ssingh→03None
[17:43:14] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns2006: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/920357 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh)
[17:44:24] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.dns.netbox
[17:45:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2006.wikimedia.org with OS bullseye
[17:46:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns2006.wikimedia.org with OS bullseye
[17:46:11] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a8-codfw - volans@cumin1001"
[17:47:14] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a8-codfw - volans@cumin1001"
[17:47:14] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:47:14] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a8-codfw.mgmt.codfw.wmnet
[17:52:38] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[17:52:54] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[17:53:21] <wikibugs>	 10SRE, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype, 10SecTeam-Processed: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10sbassett) >>! In T40860#8855780, @Dzahn wrote: > Ok, well, do you want to do anything about...
[17:54:39] <wikibugs>	 10SRE, 10WMF-General-or-Unknown, 10NewFunctionality-Worktype, 10SecTeam-Processed: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860 (10Dzahn) Alright, thanks for the details! I just meant besides GPG now, just if we should stop...
[17:55:19] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Add a new aggregate network for the cloud-private 'supernet' [puppet] - 10https://gerrit.wikimedia.org/r/920291 (https://phabricator.wikimedia.org/T324992) (owner: 10Cathal Mooney)
[17:55:43] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[17:56:35] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[17:57:48] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[17:58:47] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[17:59:05] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10Dzahn) adding @scherukuwada since they just became owner of wikisource.org (all subdomains under it) in search console last week in T336500
[18:00:04] <jouncebot>	 dancy and hashar: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T1800).
[18:01:39] <sukhe>	 !log enable puppet on A:cp-text
[18:01:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Search-Console-access-request: Please grant scherukuwada@ access to wikisource.org in the Search Console - https://phabricator.wikimedia.org/T336500 (10Dzahn) Hi @SCherukuwada , there is a volunteer asking for access at T336255. Wondering if you have thoughts on that?  Also at...
[18:02:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns2006.wikimedia.org with reason: host reimage
[18:04:56] <dancy>	 I am here to press the buttons
[18:05:38] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns2006.wikimedia.org with reason: host reimage
[18:05:52] <wikibugs>	 (03PS1) 10Ssingh: hiera: add new DNS host dns2006 [puppet] - 10https://gerrit.wikimedia.org/r/920359 (https://phabricator.wikimedia.org/T326688)
[18:06:46] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10Dzahn) Every subdomain is a separate site. Is this request really for ALL of wikisource or for a few languages? That changes the nature o...
[18:10:05] <dancy>	 Train is blocked due to ongoing toolforge/WMCS issues:  Can't reach https://train-blockers.toolforge.org/api.php right now.
[18:10:44] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.153.107 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[18:10:47] <RhinosF1>	 dancy: have you raised
[18:10:57] <RhinosF1>	 It might just need webservice restart
[18:11:01] <sukhe>	 ^ DNS message above expected
[18:13:27] <dancy>	 and we're back!
[18:13:35] <mutante>	 deployment blocked because we cant read the page with the blockers? heh
[18:13:40] <mutante>	 ok!
[18:14:00] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.153.107 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[18:14:02] <mutante>	 sukhe: pheew. thanks for mentioning that
[18:14:05] <RhinosF1>	 mutante: now fixed, think it's used to set which train now as well
[18:14:25] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920362 (https://phabricator.wikimedia.org/T330215)
[18:14:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920362 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[18:15:20] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920362 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[18:15:30] <mutante>	 RhinosF1: I think everything that is actually used for prod and deployment should be in production.. but yea
[18:16:45] <RhinosF1>	 mutante: pretty sure train-blockers is a quick hack from taavi
[18:16:55] <RhinosF1>	 It probably could be formalised more
[18:20:05] <dancy>	 Hmm.. docker-registry.wikimedia.org/php7.4-fpm-multiversion-base changed since train presync happened, so this will be a long deployment (~40 minutes) due to full image rebuild.
[18:20:49] <dancy>	 less than 40, actually.. but longer than 7.
[18:21:39] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2184 down - https://phabricator.wikimedia.org/T335640 (10jcrespo) 05Open→03Resolved a:05jcrespo→03Jhancock.wm Thank you for your help, this is good to go.
[18:22:44] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns2006.wikimedia.org with OS bullseye
[18:22:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns2006.wikimedia.org with OS bullseye completed: - dns2006 (**PASS**)...
[18:24:20] <wikibugs>	 (03CR) 10Gehel: "A few minor style comments, mostly just to prove that I did read the code. I don't know enough about Ceph to have an opinion on how this c" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison)
[18:25:22] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new dns host dns2006 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/920358 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh)
[18:25:37] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] sites.yaml: add new dns host dns2006 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/920358 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh)
[18:28:37] <sukhe>	 !log homer "cr*-codfw*" commit "Gerrit: 920358 add new DNS host dns2006": T326688
[18:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:41] <stashbot>	 T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688
[18:31:19] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: add new DNS host dns2006 [puppet] - 10https://gerrit.wikimedia.org/r/920359 (https://phabricator.wikimedia.org/T326688) (owner: 10Ssingh)
[18:34:18] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.9  refs T330215
[18:34:23] <stashbot>	 T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215
[18:34:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "pcc results: https://puppet-compiler.wmflabs.org/output/920348/41207/" [puppet] - 10https://gerrit.wikimedia.org/r/920348 (https://phabricator.wikimedia.org/T336723) (owner: 10Arturo Borrero Gonzalez)
[18:36:18] <sukhe>	 !log set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.48 208.80.153.74 208.80.153.107 ]: T326688
[18:36:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:24] <stashbot>	 T326688: Q4:rack/setup/install dns200[456] - https://phabricator.wikimedia.org/T326688
[18:39:06] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: remove dns2003 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/920363 (https://phabricator.wikimedia.org/T335777)
[18:41:03] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.network.provision for device ssw1-a8-codfw.mgmt.codfw.wmnet
[18:41:05] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[18:42:21] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10Soda) >>! In T336255#8856659, @Dzahn wrote: > Every subdomain is a separate site. Is this request really for ALL of wikisource or for a f...
[18:42:56] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin2002"
[18:42:57] <wikibugs>	 (03PS1) 10Ssingh: hiera: decommission dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/920364 (https://phabricator.wikimedia.org/T335777)
[18:43:56] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin2002"
[18:43:56] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:43:57] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns2003 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/920363 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[18:44:30] <wikibugs>	 (03PS2) 10Ssingh: hiera: decommission dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/920364 (https://phabricator.wikimedia.org/T335777)
[18:44:32] <wikibugs>	 (03Merged) 10jenkins-bot: sites.yaml: remove dns2003 from anycast_neighbors (host decom) [homer/public] - 10https://gerrit.wikimedia.org/r/920363 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[18:46:30] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[18:46:42] <sukhe>	 !log homer "cr*-codfw*" commit "Gerrit: 920363 remove to-be decommissioned host dns2003": T335777
[18:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:46] <stashbot>	 T335777: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777
[18:46:55] <ryankemper>	 !log [WDQS] Pooled `wdqs2006` (not sure why was depooled)
[18:46:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:46] <ryankemper>	 !log [WDQS] Pooled `wdqs2012`
[18:47:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:38] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns2003 [puppet] - 10https://gerrit.wikimedia.org/r/920364 (https://phabricator.wikimedia.org/T335777) (owner: 10Ssingh)
[18:49:06] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a8-codfw - volans@cumin2002"
[18:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:50:04] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a8-codfw - volans@cumin2002"
[18:50:04] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:50:04] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a8-codfw.mgmt.codfw.wmnet
[18:50:06] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2021.*
[18:50:14] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2022.*
[18:50:41] <RhinosF1>	 ryankemper: see https://phabricator.wikimedia.org/T335042 for why
[18:50:42] <wikibugs>	 (03PS1) 10Bking: query_service: Permit python2 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300)
[18:50:59] <RhinosF1>	 Likely
[18:51:19] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking)
[18:51:41] <ryankemper>	 RhinosF1: ha, doh :) I was checking the current pybal states before re-pooling stuff for that maintenance and forgot that those were the hosts for that
[18:51:44] * ryankemper was the one that depooled them xD
[18:51:51] <wikibugs>	 (03CR) 10Herron: [C: 03+2] mwlog: rotate api.log hourly [puppet] - 10https://gerrit.wikimedia.org/r/919063 (https://phabricator.wikimedia.org/T277445) (owner: 10Herron)
[18:51:55] <RhinosF1>	 ryankemper: heh
[18:52:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns2003.wikimedia.org
[18:53:58] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw
[18:54:14] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw
[18:57:00] <wikibugs>	 (03PS1) 10Jdrewniak: Ensure mw-watchlink is used for the sticky header watchlink [skins/Vector] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920239 (https://phabricator.wikimedia.org/T336640)
[18:57:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[18:59:37] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[19:00:30] <wikibugs>	 (03PS1) 10Jdrewniak: Ensure mw-watchlink is used for the sticky header watchlink [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920240 (https://phabricator.wikimedia.org/T336640)
[19:00:47] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10SCherukuwada) Please assign this to me once C-level approval and NDA have been taken care of.
[19:00:47] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns2003.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002"
[19:00:47] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:00:48] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns2003.wikimedia.org
[19:00:57] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns2003.wikimedia.org` - dns2003.wikimedia.org (**WARN**)   - Downtime...
[19:01:30] <wikibugs>	 (03PS1) 10Jdrewniak: Consolidate watchstar icon updating logic under watchstar.js [skins/Vector] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920241 (https://phabricator.wikimedia.org/T336640)
[19:01:40] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh)
[19:02:48] <wikibugs>	 (03PS1) 10Jdrewniak: Consolidate watchstar icon updating logic under watchstar.js [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920242 (https://phabricator.wikimedia.org/T336640)
[19:03:19] <wikibugs>	 (03PS1) 10Volans: sre.network.provision: allow to retry polling [cookbooks] - 10https://gerrit.wikimedia.org/r/920366 (https://phabricator.wikimedia.org/T336485)
[19:03:59] <wikibugs>	 (03PS2) 10Bking: query_service: Permit python2 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300)
[19:04:19] <sukhe>	 !log dummry run of authdns-update to confirm new hosts
[19:04:19] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/920366 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[19:04:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:27] <wikibugs>	 (03PS3) 10Bking: query_service: Permit python2 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300)
[19:06:26] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.network.provision: allow to retry polling [cookbooks] - 10https://gerrit.wikimedia.org/r/920366 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[19:06:28] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking)
[19:06:46] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Prepare for v0.1.3 release [software/wmfdb] - 10https://gerrit.wikimedia.org/r/920214 (https://phabricator.wikimedia.org/T334455) (owner: 10Ladsgroup)
[19:07:01] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41208/console" [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking)
[19:08:33] <wikibugs>	 (03Merged) 10jenkins-bot: sre.network.provision: allow to retry polling [cookbooks] - 10https://gerrit.wikimedia.org/r/920366 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[19:08:35] <wikibugs>	 (03Merged) 10jenkins-bot: Prepare for v0.1.3 release [software/wmfdb] - 10https://gerrit.wikimedia.org/r/920214 (https://phabricator.wikimedia.org/T334455) (owner: 10Ladsgroup)
[19:08:38] <wikibugs>	 (03PS1) 10Cathal Mooney: Updating ssh pubkey to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920367 (https://phabricator.wikimedia.org/T336769)
[19:08:56] <wikibugs>	 (03PS1) 10Herron: logrotate: update description in override [puppet] - 10https://gerrit.wikimedia.org/r/920368
[19:10:29] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.network.provision for device ssw1-a8-codfw.mgmt.codfw.wmnet
[19:10:31] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[19:12:25] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin2002"
[19:13:30] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a8-codfw - volans@cumin2002"
[19:13:30] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:14:47] <wikibugs>	 (03CR) 10Bking: "@jbond Wanted to solicit your advice on this one. In the original patch set, we attempted to use hieradata/common/profile/query_service.ya" [puppet] - 10https://gerrit.wikimedia.org/r/920365 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking)
[19:23:12] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:26:14] <jinxer-wm>	 (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:42:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Updating ssh pubkey to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920367 (https://phabricator.wikimedia.org/T336769) (owner: 10Cathal Mooney)
[19:43:03] <wikibugs>	 (03Merged) 10jenkins-bot: Updating ssh pubkey to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920367 (https://phabricator.wikimedia.org/T336769) (owner: 10Cathal Mooney)
[19:55:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: add gerrit1003 SSH host key known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/919405 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[19:56:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "yep, agreed it would be nice if this was automatic but also not right now" [puppet] - 10https://gerrit.wikimedia.org/r/919405 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[19:57:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "key was added on 1003 and 2002 - though this will only matter if we start replicating TO this machine - if we do that in the future" [puppet] - 10https://gerrit.wikimedia.org/r/919405 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T2000).
[20:00:05] <jouncebot>	 MatmaRex and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:25] <MatmaRex>	 hi
[20:00:32] <jan_drewniak>	 o/
[20:00:49] <wikibugs>	 (03CR) 10Jameel Kaisar: "Note: For Reference Only, Not to be Merged" [puppet] - 10https://gerrit.wikimedia.org/r/916878 (https://phabricator.wikimedia.org/T335637) (owner: 10Jameel Kaisar)
[20:01:12] <MatmaRex>	 feel free to start with jan's stuff, looks more urgent
[20:03:15] <jan_drewniak>	 Also I can self deploy
[20:03:49] <MatmaRex>	 looks like no one else is doing it, so… ;)
[20:03:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "here is the part that matters, nothing is changed on prod host: https://puppet-compiler.wmflabs.org/output/919359/41172/gerrit1003.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[20:04:24] <MatmaRex>	 i'd appreciate if you could sync and run my maintenance script afterwards. it's a dry run, just testing that it works in production before the real deployment
[20:04:55] <jan_drewniak>	 MatmaRex: yeah no problem
[20:05:32] <TheresNoTime>	 Got a deployer?
[20:05:44] <TheresNoTime>	 (yes, seems so)
[20:06:08] <mutante>	 lemme deploy a change to stop gerrit service.. on the old host :)
[20:07:46] <jan_drewniak>	 mutante: Ok, let me know when I can proceed with the backport window
[20:08:41] <mutante>	 jan_drewniak: thank you, a minute.. on it
[20:08:48] <mutante>	 confirmed noop on gerrit2002.. now gerrit1003
[20:10:07] <mutante>	 no problems on prod server
[20:10:13] <mutante>	 re-enabling puppet on old server
[20:11:01] <mutante>	 have to make sure it doesn't start gerrit service then all is done
[20:12:06] <mutante>	 confirmed:    Loaded: masked (Reason: Unit gerrit.service is masked.)
[20:12:17] <mutante>	 it is now masked which is what this change was supposed to do 
[20:12:26] <mutante>	 means it cant be started by accident and replicate or anything. 
[20:12:28] <mutante>	 I am done
[20:12:34] <mutante>	 jan_drewniak: go ahead please. thank you for patience
[20:12:49] <jan_drewniak>	 No problem, that was quick!
[20:13:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920240 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak)
[20:16:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed this all works as intended. on gerrit1001 the gerrit service is now masked and on gerrit1003 and gerrit2002 there was no change " [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[20:16:41] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10taavi)
[20:16:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "@hashar masked, not just stopped, as you asked for:)" [puppet] - 10https://gerrit.wikimedia.org/r/919359 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[20:17:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: remove gerrit1001 as a source host for migrations [puppet] - 10https://gerrit.wikimedia.org/r/919400 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[20:23:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "what this did:" [puppet] - 10https://gerrit.wikimedia.org/r/919400 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[20:24:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: disable monitoring for gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/919244 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[20:24:52] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Consolidate watchstar icon updating logic under watchstar.js [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920242 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak)
[20:28:43] <wikibugs>	 (03Merged) 10jenkins-bot: Ensure mw-watchlink is used for the sticky header watchlink [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920240 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak)
[20:28:44] <jan_drewniak>	 MatmaRex: while we're waiting for those to merge, I'm looking at your patch but I don't actually know how to deploy that... (like, where should that script be run?)
[20:29:09] <logmsgbot>	 !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:920240|Ensure mw-watchlink is used for the sticky header watchlink (T336640 T336641)]]
[20:29:14] <stashbot>	 T336640: Vector sticky header watch/unwatch button disappears when clicked - https://phabricator.wikimedia.org/T336640
[20:29:15] <stashbot>	 T336641: Vector sticky header watch/unwatch icon is always in the "not watched" state - https://phabricator.wikimedia.org/T336641
[20:29:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I could see on alert1001 how icinga checks were removed from config but I still see in Icinga web UI.. is it on 2001? running puppet there" [puppet] - 10https://gerrit.wikimedia.org/r/919244 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[20:30:12] <MatmaRex>	 jan_drewniak: hm, i'm not sure either but it's documented somewhere, let me see if i can find it
[20:30:15] <brett>	 !log Rolling out maglev LVS scheduler in drmrs (for real this time) - T263797
[20:30:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:19] <stashbot>	 T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797
[20:30:41] <logmsgbot>	 !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:920240|Ensure mw-watchlink is used for the sticky header watchlink (T336640 T336641)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[20:31:15] <RhinosF1>	 jan_drewniak: there's a deploy commands tool
[20:31:50] <MatmaRex>	 https://wikitech.wikimedia.org/wiki/Maintenance_server
[20:31:50] <RhinosF1>	 jan_drewniak: https://deploy-commands.toolforge.org/bacc
[20:32:03] <RhinosF1>	 jouncebot: now
[20:32:03] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230516T2000)
[20:32:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] remove wmcs-backup-instances script, no longer used [puppet] - 10https://gerrit.wikimedia.org/r/919896 (owner: 10Andrew Bogott)
[20:33:04] <MatmaRex>	 i don't think the deploy commands are relevant for running a maintenance script, just for other deployments
[20:33:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "this just removed a few of them, like HTTPS on gerrit1001, but gerrit1001 still has the base checks that are not specific to service and a" [puppet] - 10https://gerrit.wikimedia.org/r/919244 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[20:33:52] <RhinosF1>	 MatmaRex: the script is just ran on mwmaint ye
[20:35:14] <MatmaRex>	 jan_drewniak: summarizing from that page – i think i want you to ssh into mwmaint1002, then run `mwscript MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run`
[20:35:36] <MatmaRex>	 i'm not sure if that's exactly the right command for run.php stuff, but we can try and see, nothing terrible will happen if it fails
[20:36:03] <jan_drewniak>	 MatmaRex: ok thanks, I was just reading that :) 
[20:36:18] <icinga-wm>	 PROBLEM - pybal on lvs6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[20:36:53] <logmsgbot>	 !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:920240|Ensure mw-watchlink is used for the sticky header watchlink (T336640 T336641)]] (duration: 07m 44s)
[20:37:00] <stashbot>	 T336640: Vector sticky header watch/unwatch button disappears when clicked - https://phabricator.wikimedia.org/T336640
[20:37:01] <stashbot>	 T336641: Vector sticky header watch/unwatch icon is always in the "not watched" state - https://phabricator.wikimedia.org/T336641
[20:37:08] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[20:37:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920242 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak)
[20:39:24] <wikibugs>	 (03Merged) 10jenkins-bot: Consolidate watchstar icon updating logic under watchstar.js [skins/Vector] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920242 (https://phabricator.wikimedia.org/T336640) (owner: 10Jdrewniak)
[20:39:46] <logmsgbot>	 !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:920242|Consolidate watchstar icon updating logic under watchstar.js (T336640 T336641)]]
[20:41:25] <logmsgbot>	 !log jdrewniak@deploy1002 jdrewniak: Backport for [[gerrit:920242|Consolidate watchstar icon updating logic under watchstar.js (T336640 T336641)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:41:47] <wikibugs>	 (03PS1) 10Volans: install_server: fix ztp-juniper script [puppet] - 10https://gerrit.wikimedia.org/r/920374 (https://phabricator.wikimedia.org/T336485)
[20:45:27] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T336814 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:45:31] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336814 (10ops-monitoring-bot)
[20:46:46] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920237 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński)
[20:47:44] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs6002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[20:49:06] <logmsgbot>	 !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:920242|Consolidate watchstar icon updating logic under watchstar.js (T336640 T336641)]] (duration: 09m 19s)
[20:49:12] <stashbot>	 T336640: Vector sticky header watch/unwatch button disappears when clicked - https://phabricator.wikimedia.org/T336640
[20:49:13] <stashbot>	 T336641: Vector sticky header watch/unwatch icon is always in the "not watched" state - https://phabricator.wikimedia.org/T336641
[20:49:41] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a8-codfw.mgmt.codfw.wmnet
[20:49:50] <jan_drewniak>	 MatmaRex: ok I'm deploying yours to 1.8 first, then I'll run the script, then I'll do 1.9, does that sound good?
[20:50:33] <MatmaRex>	 yeah, sounds correct
[20:50:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920237 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński)
[20:51:12] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:51:50] <wikibugs>	 (03Merged) 10jenkins-bot: Add maint script to opt out active users from the new topic tool [extensions/DiscussionTools] (wmf/1.41.0-wmf.8) - 10https://gerrit.wikimedia.org/r/920237 (https://phabricator.wikimedia.org/T317375) (owner: 10Bartosz Dziewoński)
[20:52:19] <logmsgbot>	 !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:920237|Add maint script to opt out active users from the new topic tool (T317375)]]
[20:52:23] <stashbot>	 T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375
[20:53:27] <jan_drewniak>	 MatmaRex: assuming there's nothing to check on mwdebug?
[20:53:41] <MatmaRex>	 nope
[20:53:49] <logmsgbot>	 !log jdrewniak@deploy1002 jdrewniak and matmarex: Backport for [[gerrit:920237|Add maint script to opt out active users from the new topic tool (T317375)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[20:59:37] <logmsgbot>	 !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:920237|Add maint script to opt out active users from the new topic tool (T317375)]] (duration: 07m 18s)
[20:59:42] <stashbot>	 T317375: [Config change] Deploy New Topic Tool as opt-out preference at fi.wiki (desktop) - https://phabricator.wikimedia.org/T317375
[21:00:50] <jan_drewniak>	 alright this is what I'm gonna run `php maintenance/run.php MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run` 
[21:02:09] <MatmaRex>	 jan_drewniak: not with mwscript?
[21:02:40] <jan_drewniak>	 yeah the above just failed, lol, I'll do with mwscript
[21:02:51] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T336720 (10wiki_willy) a:03Jhancock.wm
[21:03:19] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336538 (10wiki_willy) a:03Jhancock.wm
[21:04:28] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus4001 - https://phabricator.wikimedia.org/T335585 (10wiki_willy) a:03RobH
[21:04:58] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus5001 - https://phabricator.wikimedia.org/T335587 (10wiki_willy) a:03RobH
[21:05:34] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus5001 - https://phabricator.wikimedia.org/T335587 (10wiki_willy) @RobH - this might be something we could add to the recycle pickup
[21:06:01] <wikibugs>	 10ops-drmrs, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus6001 - https://phabricator.wikimedia.org/T335588 (10wiki_willy) a:03RobH
[21:06:22] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:06:30] <icinga-wm>	 RECOVERY - pybal on lvs6002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[21:06:57] <jan_drewniak>	 MatmaRex: running the script with mwscript doesn't work, gives me this error `It does not set $maintClass and does not return a class name.` 
[21:07:09] <jan_drewniak>	 so I think I have to run it with run.php
[21:07:09] <jinxer-wm>	 (MXQueueHigh) firing: MX host mx1001:9100 has many queued messages: 4006 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh
[21:07:10] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs6002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:07:39] <MatmaRex>	 hmm
[21:07:41] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) Result of the testing with Cathal. I first want to thank @cmooney for all the help with JunOS-magics, that was pre...
[21:08:11] <MatmaRex>	 jan_drewniak: what's your exact command? mwscript should already use run.php internally
[21:08:30] <mutante>	 got p.aged. acked. 
[21:09:17] <RhinosF1>	 mutante: fyi deployment going on too
[21:09:30] <RhinosF1>	 jan_drewniak: you don't need run.php if using mwscript
[21:09:40] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[21:10:12] <mutante>	 it's not the LVS thing
[21:10:14] <jan_drewniak>	 ok so where should I run it from? `mwscript MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run` not found
[21:10:34] <jan_drewniak>	 the extension root? 
[21:11:05] <RhinosF1>	 jan_drewniak: try the full path to the script
[21:11:14] <RhinosF1>	 I believe both that and class is supported
[21:11:23] <MatmaRex>	 jan_drewniak: anywhere
[21:11:30] <MatmaRex>	 i don't think it matters what directory you're in
[21:11:43] <RhinosF1>	 mutante: are you happy with a confused mediawiki deployment going on?
[21:12:28] <mutante>	 RhinosF1: no reason to believe its' related to deployment, but I cant pay attention to deployment
[21:12:41] <RhinosF1>	 Good
[21:12:41] <MatmaRex>	 jan_drewniak: anyway, if we can't get it to work, i can try again tomorrow. it's not urgent and we're past time
[21:13:10] <RhinosF1>	 MatmaRex: might be worth switching to the php file format if you're not sure on class but
[21:13:26] <jan_drewniak>	 I'm running it with the full path ` mwscript /srv/mediawiki/php-1\41\0-wmf\8/maintenance/MediaWiki\Extension\DiscussionTools\Maintenance\NewTopicOptOutActiveUsers.php --wiki=fiwiki --dry-run` but still "not found"
[21:13:49] <Reedy>	 That wouldn't be right anyway...
[21:13:51] <RhinosF1>	 jan_drewniak: why \ instead of .
[21:13:55] <TheresNoTime>	 `mwscript extensions/DiscussionTools/maintenance/NewTopicOptOutActiveUsers.php --wiki fiwiki --dry-run` ?
[21:13:58] <MatmaRex>	 i don't think that would work. mwscript should figure out the path itself
[21:13:59] <Reedy>	 ^
[21:14:06] <RhinosF1>	 But yes what TheresNoTime said
[21:15:10] <jan_drewniak>	 https://www.irccloud.com/pastebin/0ARobnYF/
[21:15:31] <jan_drewniak>	 MatmaRex: is that an issue with the script?
[21:16:11] <MatmaRex>	 i don't know, that's weird
[21:16:18] <RhinosF1>	 That's an issue with the script yes
[21:16:20] <MatmaRex>	 i am sure that the script *can* be executed using MaintenanceRunner
[21:16:24] <icinga-wm>	 PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:16:33] <MatmaRex>	 because i've been running it with run.php locally
[21:16:59] <MatmaRex>	 well, actually
[21:17:13] <MatmaRex>	 the problem is that it's trying to execute it from the file name
[21:17:27] <MatmaRex>	 it should be executed using the weird namespace path with dots
[21:17:34] <RhinosF1>	 MatmaRex: is this script only run manually? Does it fix anything urgent or if we are unsure, could the running be halted until people confident are around?
[21:17:38] <MatmaRex>	 which is supposed to be the new hotness in executing maintenance scripts
[21:17:45] <RhinosF1>	 MatmaRex: run.php should support both types
[21:17:54] <RhinosF1>	 And that didn't work either for jan_drewniak
[21:18:06] <MatmaRex>	 RhinosF1: i have already said that we can drop it. but it looks like folks want to figure it out
[21:18:18] <RhinosF1>	 Reedy, TheresNoTime: ^
[21:18:24] <MatmaRex>	 i don't know what commands jan used, although i'd be curious to see
[21:18:43] <MatmaRex>	 i think the correct command is: mwscript MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run
[21:19:03] <MatmaRex>	 anyway. i can try again tomorrow if you want to close the window. i'm completely fine with taht
[21:19:18] <RhinosF1>	 I'm not important enough to make that call
[21:19:31] <MatmaRex>	 hence i'm asking jan_drewniak
[21:19:31] <jan_drewniak>	 https://www.irccloud.com/pastebin/CGImOadJ/NewTopicOptOutActiveUsers%20test
[21:19:32] <RhinosF1>	 But unsure people randomly guessing commands doesn't feel safe
[21:19:57] <RhinosF1>	 jan_drewniak: what was the full error when using the format with .'s
[21:19:58] <jan_drewniak>	 MatmaRex: yeah, I tried both
[21:20:04] <RhinosF1>	 The same?
[21:20:46] <MatmaRex>	 ok, that's interesting. this part: "Script '/srv/mediawiki/php-1.41.0-wmf.8/maintenance/MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers' not found (tried path '/srv/mediawiki/php-1.41.0-wmf.8/maintenance/MediaWiki.Extension.DiscussionTools.Maintenance.NewTopicOptOutActiveUsers.php' and class '/srv/mediawiki/php-1\41\0-wmf\8/maintenance/MediaWiki\Extension\DiscussionTools\Maintenance\NewTopicOptOutActiveUsers')
[21:21:06] <MatmaRex>	 i don't know where this error comes from, but it should not be building paths like that
[21:21:14] <RhinosF1>	 That's run.php
[21:21:15] <MatmaRex>	 anyway. i can look into it later
[21:21:17] <jan_drewniak>	 using the classname gives me a not found error, maybe I'm not executing if from the right path? using the file path says it's not executable with maintenance running. 
[21:21:37] <wikibugs>	 (03PS1) 10Ottomata: Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817)
[21:21:37] <RhinosF1>	 jan_drewniak: I think it's best to call it a day and let someone more confident take over
[21:21:46] <RhinosF1>	 Who don't seem to be around
[21:22:07] <jan_drewniak>	 I think so, in any case it's deployed to wmf.8, is it ok if it stays there?
[21:22:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata)
[21:22:29] <MatmaRex>	 yes. the script does nothing by itself, it can stay deployed
[21:23:07] <Reedy>	 It's probably some autoloader screwy-ness
[21:23:22] <jan_drewniak>	 MatmaRex: ok, sorry I don't know what I'm doing 😅better luck tomorrow 
[21:24:01] <MatmaRex>	 that'll teach me not to try to write things in the modern way
[21:24:36] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs6001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[21:24:40] <icinga-wm>	 PROBLEM - pybal on lvs6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[21:24:52] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:25:07] <TheresNoTime>	 I'd still have expected it to run when called directly via `mwscript` though..
[21:25:49] <Reedy>	 https://wikitech.wikimedia.org/wiki/Maintenance_server#Run_a_maintenance_script_on_a_wiki
[21:25:50] <Reedy>	 !bug 1
[21:25:50] <wm-bot>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=1
[21:26:00] <mutante>	 RhinosF1: hey, so.. can you tell me more about the deployment and the job 
[21:26:02] <RhinosF1>	 jan_drewniak: never apologise for being unsure, best thing to do is say!
[21:26:08] <mutante>	 RhinosF1: maybe it IS related after all
[21:26:13] <RhinosF1>	 mutante: deployment aborted anyway
[21:26:20] <RhinosF1>	 It's a maint script
[21:26:22] <RhinosF1>	 No one can run it
[21:26:23] <mutante>	 is it possible this sends email
[21:26:23] <wikibugs>	 (03PS1) 10Ottomata: page_content_change - Consume from mediawiki.page_change.v1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920379 (https://phabricator.wikimedia.org/T336817)
[21:26:25] <Reedy>	 TheresNoTime: The error makes sense there though, as the stuff that the old way would "need" is indeed missing
[21:26:29] <RhinosF1>	 mutante: it's not ran
[21:26:29] <Reedy>	 mutante: No
[21:26:30] <RhinosF1>	 So no
[21:26:30] <mutante>	 from wiki@wikimedia.org 
[21:26:41] <mutante>	 ok
[21:27:15] <TheresNoTime>	 Reedy: mhm, sorry yes I meant "things written the new way should be backwards compatible unless we've agreed to phase that out@
[21:27:19] <RhinosF1>	 mutante: happy to let this channel focus on the page though
[21:27:21] <TheresNoTime>	 s/@/"
[21:27:22] <mutante>	 well, nevermind then :)
[21:27:31] <Reedy>	 TheresNoTime: Blame MatmaRex for not adding the boilerplate ;D
[21:27:34] <mutante>	 RhinosF1: no, it's ok, we are using other
[21:27:40] <TheresNoTime>	 tsk
[21:27:54] <Reedy>	 I guess the new method will work fine if it's there... So that is really the fowards compatible way
[21:27:57] <MatmaRex>	 it's not supposed to be added, is hwat i heard
[21:27:57] <RhinosF1>	 mutante: cool
[21:27:58] <MatmaRex>	 anyway
[21:27:59] <Reedy>	 Until we deprecate/remove the old way, and then remove it
[21:27:59] <MatmaRex>	 i see the bug
[21:28:00] <MatmaRex>	 https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/d386f3152201cf26e7e2387094c7321b66d8ff3f/multiversion/MWScript.php#68
[21:28:05] <MatmaRex>	 this crap is messing up the class name
[21:28:29] <MatmaRex>	 you can see it in jan's error message in https://www.irccloud.com/pastebin/CGImOadJ/NewTopicOptOutActiveUsers%20test
[21:28:34] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[21:29:37] <MatmaRex>	 there are like 5 ways to run scripts now, eh
[21:29:43] <TheresNoTime>	 *URGH*
[21:29:52] <Reedy>	 only 5?
[21:30:00] * bd808 makes a new way
[21:30:08] <MatmaRex>	 mwscript DiscussionTools:NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run
[21:30:12] <MatmaRex>	 this will probably work ^
[21:30:15] <Reedy>	 syntax is hard
[21:30:18] * RhinosF1 bowing out for the night, I will dream up new ways
[21:30:30] <Reedy>	 https://wikitech.wikimedia.org/wiki/Maintenance_server#Run_a_maintenance_script_on_a_wiki needs updating
[21:30:48] * TheresNoTime only just updated it D:
[21:31:04] <RhinosF1>	 Reedy: yes it does because I need to write a new mwscript for Miraheze at some point that properly supports this madness
[21:31:06] <TheresNoTime>	 after the *last time* they changed how script ran
[21:31:49] <TheresNoTime>	 MatmaRex: (it didn't fwiw)
[21:31:55] <MatmaRex>	 heh
[21:32:06] <jan_drewniak>	 I can give it one more shot! I didn't try `mwscript DiscussionTools:NewTopicOptOutActiveUsers --wiki=fiwiki --dry-run` (with the colon)
[21:32:09] <jinxer-wm>	 (MXQueueHigh) resolved: MX host mx1001:9100 has many queued messages: 4004 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh
[21:32:21] <Reedy>	 sync the extension dir fully?
[21:32:26] <Reedy>	 I lose track of what was deployed
[21:33:27] <TheresNoTime>	 probably best to hold off entirely now.. there's Stuff(tm) going on, and that's not an ideal time to guess commands in production 
[21:33:53] <Reedy>	 unless it was trying to send emails... it's completely unrelated
[21:34:06] <Reedy>	 even if it was... it's not getting as far as executing the code for it anyway
[21:34:07] <jan_drewniak>	 alright, I'm as curious as anyone, but I'll leave it with MatmaRex then :) 
[21:35:05] <MatmaRex>	 i'll schedule it for another time
[21:39:28] <wikibugs>	 (03PS2) 10Ottomata: Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817)
[21:40:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata)
[21:43:27] <wikibugs>	 (03PS3) 10Ottomata: Declare mediawiki.page_change.v1 stream and produce from Eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920378 (https://phabricator.wikimedia.org/T336817)
[21:44:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs6001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:44:58] <icinga-wm>	 RECOVERY - pybal on lvs6001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[21:45:06] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:45:14] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs6001 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[21:47:18] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2022 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:49:58] <MatmaRex>	 i filed https://phabricator.wikimedia.org/T336819 "Maintenance script designed for run.php <class> syntax cannot be executed in Wikimedia production"
[21:50:23] <mutante>	 RhinosF1: the mail incident is resolved, fwiw
[21:51:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:01:22] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) Awesome work getting it working @volans big thanks to you too :)  >>! In T336485#8857232, @Volans wrote: > HTTP i...
[22:01:48] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:01:59] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/920374 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[22:04:06] <wikibugs>	 (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920244 (https://phabricator.wikimedia.org/T336675) (owner: 10MarcoAurelio)
[22:08:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10cmooney)
[22:13:44] <icinga-wm>	 RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:13:56] <wikibugs>	 (03PS1) 10Ottomata: Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507)
[22:14:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata)
[22:15:18] <wikibugs>	 (03PS2) 10Ottomata: Create mediawiki-page-content-change-enrichment namespaces in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/920382 (https://phabricator.wikimedia.org/T330507)
[22:21:31] <wikibugs>	 (03PS4) 10MarcoAurelio: dblists: Close akwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920244 (https://phabricator.wikimedia.org/T336675)
[22:24:56] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "this should now happen after we reimaged gerrit2002 and as part of that also moved the data" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[22:25:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gerrit: remove gerrit1001 from .ssh/config [puppet] - 10https://gerrit.wikimedia.org/r/919403 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[22:25:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gerrit: add gerrit1003 to hosts using KexAlgo ecdh-sha2-nistp521 for ssh [puppet] - 10https://gerrit.wikimedia.org/r/919402 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn)
[22:25:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gerrit: remove gerrit1001 from ssh_allowed hosts and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/919401 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[22:26:38] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "As opposed to other changes that are ready to go we should probably wait here until the host is actually shut down. ?" [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[22:28:26] <hauskater>	 jouncebot: nowandnext
[22:28:26] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 31 minute(s)
[22:28:26] <jouncebot>	 In 7 hour(s) and 31 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230517T0600)
[22:31:26] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335309)
[22:41:01] <wikibugs>	 (03PS2) 10Kimberly Sarabia: Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335309)
[22:41:03] <wikibugs>	 (03PS3) 10Jdlrobson: Enable zebra ab test in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia)
[22:49:53] <hauskater>	 I think https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/874925 can be abandoned now
[22:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:52:40] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Enable zebra ab test in hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia)
[22:54:49] <wikibugs>	 (03PS2) 10Jdlrobson: Launch content separation Zebra AB Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918568 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia)
[22:55:29] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "Looks ready to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918568 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia)
[23:05:42] <wikibugs>	 (03PS3) 10MarcoAurelio: Update pnbwiktionary project namespace and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859080 (https://phabricator.wikimedia.org/T323545) (owner: 10Middle river exports)
[23:05:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update pnbwiktionary project namespace and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859080 (https://phabricator.wikimedia.org/T323545) (owner: 10Middle river exports)
[23:19:07] <wikibugs>	 (03CR) 10MarcoAurelio: Update pnbwiktionary project namespace and sitename (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859080 (https://phabricator.wikimedia.org/T323545) (owner: 10Middle river exports)
[23:25:21] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 04-1] "Hello. Since this patch was uploaded the configuration files have changed a bit. It needs to be rebased and modified accordingly, or maybe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859080 (https://phabricator.wikimedia.org/T323545) (owner: 10Middle river exports)
[23:37:54] * Krinkle staging on mwdebug1002
[23:53:19] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+1] Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle)
[23:57:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ssingh)