[00:35:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[00:39:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919377
[00:39:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919377 (owner: 10TrainBranchBot)
[00:56:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919377 (owner: 10TrainBranchBot)
[00:59:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 241.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[01:19:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 200.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[01:22:20] <wikibugs>	 (03CR) 10BryanDavis: signup:blocklist Expand blocklist feature (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 (owner: 10Slyngshede)
[01:39:04] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: git-sync-upstream failing - https://phabricator.wikimedia.org/T336263 (10Andrew) 05Open→03Resolved
[01:56:25] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:02:39] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,session-c8683.scope,user-runtime-dir@114.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:04:57] <wikibugs>	 (03PS2) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918922 (https://phabricator.wikimedia.org/T327868)
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:17] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,session-c8683.scope,user-runtime-dir@114.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:11:53] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,session-c8683.scope,user-runtime-dir@114.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:13:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:16:31] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,session-c8683.scope,user-runtime-dir@114.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:28:34] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[02:32:03] <wikibugs>	 (03PS1) 10Andrew Bogott: mwopenstackclients: replace use of os_client_config with openstack.config [puppet] - 10https://gerrit.wikimedia.org/r/919471 (https://phabricator.wikimedia.org/T336104)
[03:39:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:44:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:35:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:49:53] <icinga-wm>	 PROBLEM - SSH on bast4004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:51:29] <icinga-wm>	 RECOVERY - SSH on bast4004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:36:23] <_joe_>	 !log building bookworm image for the first time T335560
[05:36:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:28] <stashbot>	 T335560: Publish Wikimedia bookworm base Docker image - https://phabricator.wikimedia.org/T335560
[05:42:45] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:43] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:53:37] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:56:33] <icinga-wm>	 RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (db2139) taken on 2023-05-15 02:31:57 (1289 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[06:01:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:13:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:14:31] <icinga-wm>	 PROBLEM - SSH on bast4004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:16:05] <icinga-wm>	 RECOVERY - SSH on bast4004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:17:38] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Access port speed <= 100Mbps False positives - https://phabricator.wikimedia.org/T336511 (10ayounsi)
[06:18:24] <wikibugs>	 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Access port speed <= 100Mbps False positives - https://phabricator.wikimedia.org/T336511 (10ayounsi) I muted the alert for now until we can get to the bottom of it as it was spamming too much.
[06:19:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: docker::baseimages: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560)
[06:28:35] <jinxer-wm>	 (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[06:33:34] <jinxer-wm>	 (Access port speed <= 100Mbps) resolved: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps   - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps
[06:38:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41177/console" [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:05:12] <wikibugs>	 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10jcrespo) hi, cloudcontrol2001-dev is failing to do all its backups. Usually this is due to maintenance or a defect on setup:   ` root@backup1001:~$ check_...
[07:05:15] <wikibugs>	 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10jcrespo) I can file this on a separate ticket, if needed.
[07:08:44] <wikibugs>	 (03PS4) 10Elukey: service::catalog: set lvs_setup for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756)
[07:08:58] <wikibugs>	 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10jcrespo) I see what is going on- the backups are happening, but they return empty- which is a weird setup and we interpret as a failure (not intended). We...
[07:10:37] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41178/console" [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[07:13:48] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up) [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236)
[07:16:22] <wikibugs>	 (03CR) 10Ayounsi: "Would it make sens to just ignore all the dev/test hosts from backups?" [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236) (owner: 10Jcrespo)
[07:19:05] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) (owner: 10Dzahn)
[07:20:25] <wikibugs>	 (03CR) 10Jcrespo: bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236) (owner: 10Jcrespo)
[07:23:18] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] microsites: change rewrite rule for https://transparency.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) (owner: 10Dzahn)
[07:35:43] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/919783
[07:39:48] <wikibugs>	 (03PS1) 10Elukey: Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756)
[07:40:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[07:40:45] <wikibugs>	 (03PS4) 10David Caro: d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507)
[07:40:49] <wikibugs>	 (03PS1) 10David Caro: Fix tests to adapt the latest toolforge-weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919786
[07:40:55] <wikibugs>	 (03PS1) 10David Caro: utils: allow specifying the version for bump_version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919787
[07:41:29] <wikibugs>	 (03CR) 10David Caro: d/changelog: prepare release 0.98 (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro)
[07:42:04] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/919783 (owner: 10Volans)
[07:42:47] <wikibugs>	 (03PS2) 10Slyngshede: Search: add function for search users. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919073 (https://phabricator.wikimedia.org/T335476)
[07:43:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Search: add function for search users. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919073 (https://phabricator.wikimedia.org/T335476) (owner: 10Slyngshede)
[07:47:12] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/919783 (owner: 10Volans)
[07:49:49] <wikibugs>	 (03PS1) 10Volans: Upstream release v7.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/919789
[07:49:57] <wikibugs>	 (03PS3) 10Slyngshede: Search: add function for search users. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919073 (https://phabricator.wikimedia.org/T335476)
[07:51:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: Fix excimer_mysql_user typo [puppet] - 10https://gerrit.wikimedia.org/r/919422 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle)
[07:54:29] <wikibugs>	 (03PS1) 10Stevemunene: Add stat1009 dummykeytabs [labs/private] - 10https://gerrit.wikimedia.org/r/919791 (https://phabricator.wikimedia.org/T336036)
[07:54:33] <wikibugs>	 10SRE, 10WMF-Legal, 10serviceops-collab, 10wikimediafoundation.org: Update redirect for transparency.wikimedia.org - https://phabricator.wikimedia.org/T336301 (10Jelto) 05In progress→03Resolved a:03Dzahn  Change merged, redirect points to https://wikimediafoundation.org/about/transparency/current/ no...
[07:57:04] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v7.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/919789 (owner: 10Volans)
[08:02:14] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v7.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/919789 (owner: 10Volans)
[08:04:16] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) 05Open→03Resolved I've put db2139 back into service, no errors or issues observed so far. I reloaded all data from the most recent backup. Than...
[08:08:14] <volans>	 !log uploaded spicerack_7.1.0 to apt.wikimedia.org bullseye-wikimedia
[08:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:33] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Add stat1009 dummykeytabs [labs/private] - 10https://gerrit.wikimedia.org/r/919791 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene)
[08:09:44] <wikibugs>	 (03PS1) 10Volans: sre.hosts.provision: adapt call to DHCPConfMgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/919792
[08:09:46] <wikibugs>	 (03PS1) 10Volans: sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485)
[08:09:52] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add stat1009 dummykeytabs [labs/private] - 10https://gerrit.wikimedia.org/r/919791 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene)
[08:11:30] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans)
[08:12:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[08:16:12] <wikibugs>	 (03PS2) 10Volans: sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485)
[08:21:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sre.hosts.provision: adapt call to DHCPConfMgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/919792 (owner: 10Volans)
[08:22:51] <volans>	 !log installed spicerack_7.1.0 on cumin2002
[08:22:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:00] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] service::catalog: set lvs_setup for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[08:26:06] <volans>	 !log installed spicerack_7.1.0 on cumin1001
[08:26:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:19] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.provision: adapt call to DHCPConfMgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/919792 (owner: 10Volans)
[08:26:44] <elukey>	 !log restart pybal on lvs2010 and lvs2009 to pick up new LVS VIP for ml-staging k8s ingress - T335756
[08:26:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:47] <stashbot>	 T335756: Create a staging ingress configuration for ml-staging-codfw - https://phabricator.wikimedia.org/T335756
[08:28:37] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.provision: adapt call to DHCPConfMgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/919792 (owner: 10Volans)
[08:30:00] <wikibugs>	 (03PS5) 10Filippo Giunchedi: prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620)
[08:30:01] <elukey>	 pybal on 2010 looks good, proceeding with 2009
[08:32:44] <elukey>	 aand lvs2009 done
[08:35:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:37:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:39:26] <wikibugs>	 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10aborrero) >>! In T336236#8849618, @jcrespo wrote: > I see what is going on- the backups are happening, but they return empty- which...
[08:39:40] <wikibugs>	 (03PS6) 10Filippo Giunchedi: prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620)
[08:40:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236) (owner: 10Jcrespo)
[08:42:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:42:58] <wikibugs>	 (03PS1) 10Elukey: service::catalog: switch k8s-ingress-ml-staging to production [puppet] - 10https://gerrit.wikimedia.org/r/919795 (https://phabricator.wikimedia.org/T335756)
[08:45:12] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[08:45:25] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[08:47:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236) (owner: 10Jcrespo)
[08:47:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41180/console" [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi)
[08:49:13] <wikibugs>	 (03CR) 10Ayounsi: sre.network.provision: add new cookbook (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[08:50:12] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41181/console" [puppet] - 10https://gerrit.wikimedia.org/r/919795 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[08:53:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:54:12] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) I have now been provided with the following two Wikitech accounts for the two users:  * Thom...
[08:57:29] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:toolforge::proxy: uninstall toolsweblogster [puppet] - 10https://gerrit.wikimedia.org/r/917923 (owner: 10Majavah)
[08:58:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:01:12] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] service::catalog: switch k8s-ingress-ml-staging to production [puppet] - 10https://gerrit.wikimedia.org/r/919795 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[09:05:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] prometheus: don't fail on unknown blackbox probe type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi)
[09:05:18] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[09:06:32] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] service::catalog: switch k8s-ingress-ml-staging to production [puppet] - 10https://gerrit.wikimedia.org/r/919795 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[09:08:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[09:11:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1020.eqiad.wmnet with reason: Maintenance
[09:11:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1020.eqiad.wmnet with reason: Maintenance
[09:11:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1020 (T335845)', diff saved to https://phabricator.wikimedia.org/P48217 and previous config saved to /var/cache/conftool/dbconfig/20230515-091139-ladsgroup.json
[09:12:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite)
[09:12:57] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:13:03] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:14:23] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:14:29] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:14:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:14:56] <wikibugs>	 (03CR) 10Volans: "Thanks for the quick review, addressed comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[09:14:57] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:14:59] <wikibugs>	 10SRE, 10Platform Engineering, 10Release Pipeline, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10hashar) I have filed T335780 with the goal of updating the `docker.wikimedia.org/releng/cassandra311` image which...
[09:16:07] <wikibugs>	 (03PS3) 10Volans: sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485)
[09:25:00] <wikibugs>	 (03CR) 10Elukey: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[09:27:36] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10ci-test-error: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10hashar) 05Open→03Resolved After 2 years, I am assuming this one got solved since some change got merged, we upgraded pip/tox etc since then.  See also...
[09:28:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Not tested but overall lgtm." [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[09:28:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1020 (T335845)', diff saved to https://phabricator.wikimedia.org/P48218 and previous config saved to /var/cache/conftool/dbconfig/20230515-092810-ladsgroup.json
[09:28:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT endpoints) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:33:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT endpoints) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:37:20] <wikibugs>	 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10wmde-wikidata-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10Ladsgroup) Until it gets changed to HTTPS, basically we have two options:  - Remove the link from sidebar and add it as text field in act...
[09:38:54] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15802
[09:39:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15802
[09:39:54] <wikibugs>	 (03PS1) 10Majavah: varnishkafka: remove absented logster integration [puppet] - 10https://gerrit.wikimedia.org/r/919802
[09:39:56] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::proxy: remove absented logster resources [puppet] - 10https://gerrit.wikimedia.org/r/919803
[09:39:58] <wikibugs>	 (03PS1) 10Majavah: logster: remove classes [puppet] - 10https://gerrit.wikimedia.org/r/919804
[09:40:32] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MatthewVernon)
[09:43:12] <wikibugs>	 (03CR) 10Majavah: "PCC: https://puppet-compiler.wmflabs.org/output/919802/41182/, deployment-prep issues seem unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/919802 (owner: 10Majavah)
[09:43:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1020', diff saved to https://phabricator.wikimedia.org/P48219 and previous config saved to /var/cache/conftool/dbconfig/20230515-094317-ladsgroup.json
[09:49:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1123 T334910', diff saved to https://phabricator.wikimedia.org/P48220 and previous config saved to /var/cache/conftool/dbconfig/20230515-094938-ladsgroup.json
[09:49:43] <stashbot>	 T334910: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910
[09:50:11] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:51:56] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Create Benthos image - https://phabricator.wikimedia.org/T336658 (10kamila)
[09:52:02] <wikibugs>	 (03PS1) 10Ladsgroup: conftool-data: Remove db1123 [puppet] - 10https://gerrit.wikimedia.org/r/919805 (https://phabricator.wikimedia.org/T334910)
[09:52:19] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Create Benthos image - https://phabricator.wikimedia.org/T336658 (10kamila) 05Open→03In progress
[09:52:23] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10kamila)
[09:53:12] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] conftool-data: Remove db1123 [puppet] - 10https://gerrit.wikimedia.org/r/919805 (https://phabricator.wikimedia.org/T334910) (owner: 10Ladsgroup)
[09:53:36] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Create Benthos docker image - https://phabricator.wikimedia.org/T336658 (10kamila)
[09:54:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Remove db1123 from dbctl T334910', diff saved to https://phabricator.wikimedia.org/P48221 and previous config saved to /var/cache/conftool/dbconfig/20230515-095412-ladsgroup.json
[09:56:24] <wikibugs>	 (03PS1) 10Ladsgroup: Remove db1123 [puppet] - 10https://gerrit.wikimedia.org/r/919807 (https://phabricator.wikimedia.org/T334910)
[09:58:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1020', diff saved to https://phabricator.wikimedia.org/P48222 and previous config saved to /var/cache/conftool/dbconfig/20230515-095823-ladsgroup.json
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1000)
[10:01:16] <wikibugs>	 (03PS2) 10Elukey: Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756)
[10:02:12] <wikibugs>	 (03PS1) 10Hnowlan: admin_ng, thumbor: double memory limit for namespace and pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/919808 (https://phabricator.wikimedia.org/T334488)
[10:02:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Untested but overall LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[10:05:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1123.eqiad.wmnet
[10:11:01] <icinga-wm>	 PROBLEM - Check systemd state on db2184 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:11:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox
[10:13:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:13:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1123.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[10:13:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1020 (T335845)', diff saved to https://phabricator.wikimedia.org/P48223 and previous config saved to /var/cache/conftool/dbconfig/20230515-101329-ladsgroup.json
[10:14:19] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] admin_ng, thumbor: double memory limit for namespace and pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/919808 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[10:15:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1123.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[10:15:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:15:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1123.eqiad.wmnet
[10:18:43] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Remove db1123 [puppet] - 10https://gerrit.wikimedia.org/r/919807 (https://phabricator.wikimedia.org/T334910) (owner: 10Ladsgroup)
[10:19:34] <Amir1>	 !log Removing db1123 from zarcillo T334910
[10:19:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:38] <stashbot>	 T334910: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910
[10:20:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance
[10:20:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance
[10:20:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1023 (T335845)', diff saved to https://phabricator.wikimedia.org/P48224 and previous config saved to /var/cache/conftool/dbconfig/20230515-102038-ladsgroup.json
[10:21:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:24:52] <wikibugs>	 (03CR) 10Ayounsi: Validators: improve device name, add interface/outlet (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[10:25:19] <wikibugs>	 (03PS5) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590)
[10:26:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:29:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:31:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023 (T335845)', diff saved to https://phabricator.wikimedia.org/P48225 and previous config saved to /var/cache/conftool/dbconfig/20230515-103105-ladsgroup.json
[10:31:26] <wikibugs>	 (03CR) 10JMeybohm: prometheus/k8s: add selective scraping of ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[10:34:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:36:02] <wikibugs>	 (03CR) 10JMeybohm: prometheus/k8s: add selective scraping of ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[10:39:52] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910 (10Ladsgroup) a:05Ladsgroup→03wiki_willy
[10:46:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023', diff saved to https://phabricator.wikimedia.org/P48226 and previous config saved to /var/cache/conftool/dbconfig/20230515-104611-ladsgroup.json
[10:56:41] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: prometheus/k8s: add selective scraping of ports in staging [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822)
[10:56:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: prometheus/k8s: add selective scraping of ports in staging (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[10:58:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus/k8s: add selective scraping of ports in staging [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[11:01:05] <wikibugs>	 (03PS6) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324)
[11:01:15] <wikibugs>	 (03PS9) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T300324)
[11:01:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023', diff saved to https://phabricator.wikimedia.org/P48227 and previous config saved to /var/cache/conftool/dbconfig/20230515-110118-ladsgroup.json
[11:01:20] <wikibugs>	 (03PS5) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324)
[11:02:28] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: prometheus/k8s: add selective scraping of ports in staging [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822)
[11:03:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Allow basic validation of envoy config in CI [puppet] - 10https://gerrit.wikimedia.org/r/915745 (https://phabricator.wikimedia.org/T304660) (owner: 10JMeybohm)
[11:03:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] envoyproxy: Fix most validation errors in the `good` build_envoy_config tests [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) (owner: 10RLazarus)
[11:16:18] <wikibugs>	 (03PS1) 10Btullis: Add two ldap_only users from bishopfox [puppet] - 10https://gerrit.wikimedia.org/r/919822 (https://phabricator.wikimedia.org/T336357)
[11:16:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023 (T335845)', diff saved to https://phabricator.wikimedia.org/P48228 and previous config saved to /var/cache/conftool/dbconfig/20230515-111624-ladsgroup.json
[11:19:04] <wikibugs>	 (03CR) 10Btullis: "This is to replace: https://gerrit.wikimedia.org/r/c/operations/puppet/+/918519" [puppet] - 10https://gerrit.wikimedia.org/r/919822 (https://phabricator.wikimedia.org/T336357) (owner: 10Btullis)
[11:24:52] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "see inline" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/918610 (https://phabricator.wikimedia.org/T258841) (owner: 10BryanDavis)
[11:25:11] <wikibugs>	 (03Abandoned) 10Majavah: P:base::firewall: add non-etcd way to reject traffic [puppet] - 10https://gerrit.wikimedia.org/r/918408 (owner: 10Majavah)
[11:26:58] <wikibugs>	 (03PS1) 10Stevemunene: Bring stat1009 into service [puppet] - 10https://gerrit.wikimedia.org/r/919826 (https://phabricator.wikimedia.org/T336036)
[11:43:45] <icinga-wm>	 RECOVERY - Checks that the airflow database for airflow analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[11:44:13] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[11:45:45] <wikibugs>	 (03CR) 10Volans: "replies inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[12:13:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM Filippo thanks for the help here!  I'm happy to roll it out with current warning level and see how we get on?" [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney)
[12:16:23] <wikibugs>	 (03PS1) 10Jaime Nuche: doc: add rsync password [labs/private] - 10https://gerrit.wikimedia.org/r/919833 (https://phabricator.wikimedia.org/T336168)
[12:16:48] <wikibugs>	 (03PS6) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590)
[12:16:50] <wikibugs>	 (03PS1) 10Jaime Nuche: doc: temporary config for docs publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168)
[12:18:31] <wikibugs>	 (03CR) 10Ayounsi: Validators: improve device name, add interface/outlet (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[12:19:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Verification can happen at https://prometheus-eqiad.wikimedia.org/k8s-staging/" [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto)
[12:23:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Agreed, let's ship it!" [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney)
[12:27:53] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15802
[12:28:46] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15802
[12:35:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:35:15] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MatthewVernon)
[12:35:50] <wikibugs>	 (03PS2) 10Jgreen: Add frav1003 dns and rdns entries [dns] - 10https://gerrit.wikimedia.org/r/919239 (https://phabricator.wikimedia.org/T334400) (owner: 10Dwisehaupt)
[12:39:50] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Add frav1003 dns and rdns entries [dns] - 10https://gerrit.wikimedia.org/r/919239 (https://phabricator.wikimedia.org/T334400) (owner: 10Dwisehaupt)
[12:40:07] <wikibugs>	 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10Ottomata) This is probably Data Eng SRE responsibility!  Tagged.
[12:40:13] <wikibugs>	 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10Ottomata)
[12:41:11] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Add donorpreferences, delete dash CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/919423 (https://phabricator.wikimedia.org/T335793) (owner: 10Dwisehaupt)
[12:41:20] <wikibugs>	 (03PS2) 10Jgreen: Add donorpreferences, delete dash CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/919423 (https://phabricator.wikimedia.org/T335793) (owner: 10Dwisehaupt)
[12:41:30] <wikibugs>	 (03CR) 10Jgreen: [V: 03+2] Add donorpreferences, delete dash CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/919423 (https://phabricator.wikimedia.org/T335793) (owner: 10Dwisehaupt)
[12:41:45] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Jelto)
[12:42:32] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Add donorpreferences, delete dash CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/919423 (https://phabricator.wikimedia.org/T335793) (owner: 10Dwisehaupt)
[12:49:40] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 4 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10JArguello-WMF)
[12:51:13] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10Ottomata) > Basically add a configuration that allows to figure out article_url  => derived urls as a mapping, and eliminate all the n...
[12:53:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 242k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[12:55:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[12:55:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[12:55:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[12:55:48] <wikibugs>	 (03Merged) 10jenkins-bot: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[12:55:50] <wikibugs>	 (03Merged) 10jenkins-bot: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[12:55:53] <wikibugs>	 (03Merged) 10jenkins-bot: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[12:58:23] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:58:54] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[12:59:19] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T320434)
[13:00:04] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:31] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[13:03:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 201.9k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[13:07:17] <wikibugs>	 (03CR) 10Ottomata: Add flink-app default log config and use it in page_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin)
[13:07:49] <wikibugs>	 (03PS3) 10Elukey: Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756)
[13:09:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[13:10:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey)
[13:22:22] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Fix tests to adapt the latest toolforge-weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919786 (owner: 10David Caro)
[13:22:26] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] utils: allow specifying the version for bump_version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919787 (owner: 10David Caro)
[13:22:30] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro)
[13:23:23] <wikibugs>	 (03Merged) 10jenkins-bot: Fix tests to adapt the latest toolforge-weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919786 (owner: 10David Caro)
[13:23:25] <wikibugs>	 (03Merged) 10jenkins-bot: utils: allow specifying the version for bump_version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919787 (owner: 10David Caro)
[13:23:54] <wikibugs>	 (03PS2) 10Majavah: ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300
[13:24:11] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro)
[13:24:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah)
[13:24:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbook not forcing a Puppet agent run on lvs2011, lvs2012 - https://phabricator.wikimedia.org/T336593 (10Volans) @ssingh I'm not sure I agree with your renaming. What do you mean exactly? :)  The only race I can think of is that the removal from pup...
[13:25:30] <wikibugs>	 (03PS3) 10Majavah: ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300
[13:26:59] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41185/console" [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah)
[13:28:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Jclark-ctr) 05Open→03Resolved Server has not had any errors since flea power drain  closing ticket
[13:32:14] <wikibugs>	 (03CR) 10Volans: [C: 03+2] installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[13:32:32] <wikibugs>	 (03CR) 10Volans: [C: 03+2] install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[13:32:43] <wikibugs>	 (03PS8) 10Volans: install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485)
[13:33:24] <volans>	 !log disabling puppet on the install hosts to deploy changes for T336485
[13:33:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:29] <stashbot>	 T336485: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485
[13:34:32] <wikibugs>	 (03PS14) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485)
[13:36:08] <wikibugs>	 (03CR) 10Volans: [C: 03+2] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[13:36:26] <wikibugs>	 (03PS6) 10Volans: install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485)
[13:37:40] <wikibugs>	 (03CR) 10Volans: [C: 03+2] install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[13:38:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbook not forcing a Puppet agent run on lvs2011, lvs2012 - https://phabricator.wikimedia.org/T336593 (10ssingh) >>! In T336593#8850410, @Volans wrote: > @ssingh I'm not sure I agree with your renaming. What do you mean exactly? :)  "Reimaging cookb...
[13:42:09] <wikibugs>	 (03PS1) 10Klausman: hiera/k8s/ml: add namespace permissions for revertrisk [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124)
[13:42:50] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis) a:03BTullis I'll have a look at this one.
[13:43:28] <wikibugs>	 (03PS1) 10Klausman: k8s/ml/prod: Add revertrisk namespace and permissions, plus TLS config [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124)
[13:45:41] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add two ldap_only users from bishopfox [puppet] - 10https://gerrit.wikimedia.org/r/919822 (https://phabricator.wikimedia.org/T336357) (owner: 10Btullis)
[13:45:58] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[13:46:18] <wikibugs>	 (03PS1) 10Klausman: hiera/k8s/ml: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845
[13:46:24] <wikibugs>	 (03PS1) 10Andrew Bogott: galera: greatly increase max connections [puppet] - 10https://gerrit.wikimedia.org/r/919846
[13:46:30] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "looks good to me. But let's get a second opinion from ServiceOps" [puppet] - 10https://gerrit.wikimedia.org/r/919350 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney)
[13:47:18] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney)
[13:47:25] <wikibugs>	 (03PS2) 10Klausman: hiera/k8s/ml: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845
[13:48:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:48:56] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] galera: greatly increase max connections [puppet] - 10https://gerrit.wikimedia.org/r/919846 (owner: 10Andrew Bogott)
[13:49:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] galera: greatly increase max connections [puppet] - 10https://gerrit.wikimedia.org/r/919846 (owner: 10Andrew Bogott)
[13:49:19] <wikibugs>	 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ssingh)
[13:50:13] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab: make sure letsencrypt extention is disabled [puppet] - 10https://gerrit.wikimedia.org/r/919022 (https://phabricator.wikimedia.org/T336476) (owner: 10Jelto)
[13:51:38] <wikibugs>	 (03PS1) 10Ssingh: hiera: temporarily remove dns2002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/919847 (https://phabricator.wikimedia.org/T335042)
[13:53:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:53:17] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) I've tested a reimage of a physical host and worked fine, we still have a bit of duplication of requests, do you s...
[13:53:50] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans)
[13:54:14] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[13:56:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] hiera/k8s/ml: Add dummy data for revertrisk user (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/919845 (owner: 10Klausman)
[13:56:58] <volans>	 !log re-enabled puppet on the install hosts to deploy changes for T336485
[13:57:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:02] <stashbot>	 T336485: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485
[13:57:12] <wikibugs>	 (03Merged) 10jenkins-bot: sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans)
[13:57:39] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10daniel) >>! In T324200#8850266, @Ottomata wrote: >> Basically add a configuration that allows to figure out article_url  => derived ur...
[13:57:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] hiera/k8s/ml: add namespace permissions for revertrisk (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[13:57:59] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 4 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) I have now added `uid=twilsonbf` and `uid=ryan-bf` to the `nda` group in LDAP. ` btullis@mwm...
[13:58:17] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 4 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis)
[13:58:26] <wikibugs>	 (03PS3) 10Klausman: role::mlserve: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845
[13:58:40] <wikibugs>	 (03CR) 10Klausman: role::mlserve: Add dummy data for revertrisk user (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/919845 (owner: 10Klausman)
[13:58:53] <wikibugs>	 (03CR) 10Elukey: k8s/ml/prod: Add revertrisk namespace and permissions, plus TLS config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[14:00:06] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[14:00:11] <wikibugs>	 (03PS2) 10Klausman: admin_ng/ml-serve: Add revertrisk namespace and permissions, plus TLS config [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124)
[14:00:44] <wikibugs>	 (03PS2) 10Klausman: admin_ng/ml-serve: add namespace permissions for revertrisk [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124)
[14:00:47] <wikibugs>	 (03CR) 10Klausman: admin_ng/ml-serve: add namespace permissions for revertrisk (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[14:00:53] <wikibugs>	 (03CR) 10Klausman: admin_ng/ml-serve: Add revertrisk namespace and permissions, plus TLS config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[14:01:12] <wikibugs>	 (03CR) 10Elukey: admin_ng/ml-serve: Add revertrisk namespace and permissions, plus TLS config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[14:01:57] <wikibugs>	 (03PS3) 10Klausman: admin_ng: add revertrisk namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124)
[14:02:01] <wikibugs>	 (03CR) 10Klausman: admin_ng: add revertrisk namespace and config to ml-serve clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[14:03:16] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[14:03:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Jclark-ctr)
[14:03:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] admin_ng: add revertrisk namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[14:03:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Jclark-ctr) snapshot1016 C6 U26 port 24 cableid 3230 snapshot1017 D6 U39. port 39 Cableid. 23000051
[14:06:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 (10Jclark-ctr)
[14:06:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 (10Jclark-ctr) 05Stalled→03Resolved
[14:06:35] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] role::mlserve: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845 (owner: 10Klausman)
[14:06:39] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] role::mlserve: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845 (owner: 10Klausman)
[14:06:41] <wikibugs>	 (03PS1) 10JMeybohm: Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324)
[14:06:43] <wikibugs>	 (03PS1) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324)
[14:07:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[14:08:29] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Improve Thumbor's memcached infrastructure - https://phabricator.wikimedia.org/T318695 (10jijiki)
[14:08:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10LSobanski) a:03Arnoldokoth
[14:08:40] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Improve Thumbor's memcached infrastructure - https://phabricator.wikimedia.org/T318695 (10jijiki)
[14:08:44] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] admin_ng/ml-serve: add namespace permissions for revertrisk [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[14:08:48] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki)
[14:08:52] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki)
[14:09:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10LSobanski) p:05Medium→03High
[14:09:26] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] admin_ng: add revertrisk namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[14:09:28] <wikibugs>	 (03PS2) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324)
[14:10:29] <wikibugs>	 (03PS4) 10Herron: profile::arclamp::redis: introduce/move arclamp redis config to profile [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277)
[14:11:44] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: add revertrisk namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman)
[14:13:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:13:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "LGTM once you've also bumped chart versions 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm)
[14:16:08] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:17:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:17:08] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:19:43] <wikibugs>	 (03CR) 10Herron: [C: 03+2] profile::arclamp::redis: introduce/move arclamp redis config to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron)
[14:20:12] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[14:20:37] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[14:21:04] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye
[14:22:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:22:11] <wikibugs>	 (03CR) 10Volans: sre.hosts.reimage: merge reimage cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[14:24:35] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: testing transferpy cookbook
[14:24:37] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: testing transferpy cookbook
[14:24:48] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: docker::baseimages: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560)
[14:25:05] <wikibugs>	 (03PS19) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510
[14:25:31] <wikibugs>	 (03CR) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[14:26:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[14:34:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] docker::baseimages: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto)
[14:43:25] <wikibugs>	 (03PS20) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015)
[14:44:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle)
[14:45:32] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10Joe) >>! In T324200#8850266, @Ottomata wrote: >> Basically add a configuration that allows to figure out article_url  => derived urls...
[14:46:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto)
[14:47:26] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Improve Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki)
[14:48:06] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Improve Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki)
[14:48:43] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] site: revert vrts2001 role post re-image [puppet] - 10https://gerrit.wikimedia.org/r/918486 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth)
[15:00:07] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete
[15:00:20] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete
[15:00:36] <icinga-wm>	 RECOVERY - Check systemd state on db2184 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:02:22] <wikibugs>	 (03PS2) 10JMeybohm: Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324)
[15:02:24] <wikibugs>	 (03PS3) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324)
[15:02:59] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) @Volans is it possible to have a full pcap of those `unknown network segment` ?
[15:04:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: vrts1001: Switch to insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/919856
[15:04:55] <wikibugs>	 (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[15:05:16] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] "New secret needs to be added before this patch can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[15:30:05] <jouncebot>	 jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1530).
[15:30:39] * Krinkle is staging on mwdebug1002
[15:39:16] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:40:38] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:51:26] <vgutierrez>	 hmmm wikibugs quit 40 minutes ago
[15:53:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:53:50] <dancy>	 vgutierrez: I get the impression that something is wrong with toolforge, because https://versions.toolforge.org/ isnt' working either
[15:54:47] <vgutierrez>	 hmm kinda expected apparently after checking #wikimedia-cloud topic: "Status: system instability due to ongoing maintenance"
[15:55:30] <dancy>	 aha!
[15:55:35] <dancy>	 Thanks for finding that. :-)
[15:57:45] * Krinkle done testing on mwdebug1002
[15:58:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:30:10] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:35:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1700)
[17:00:05] <jouncebot>	 ryankemper: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1700).
[17:15:11] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[17:15:12] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[17:26:15] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin2002"
[17:27:20] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin2002"
[17:27:20] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:27:21] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[17:29:21] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin2002"
[17:30:23] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin2002"
[17:30:23] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:30:24] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[17:39:11] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet
[17:39:13] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[17:41:14] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin2002"
[17:41:26] <mutante>	 I don't see my gerrit comments on this channel anymore.
[17:41:55] <volans>	 mutante: see above, bots missing because of cloud issues
[17:42:07] <volans>	 I don't know the current situation
[17:42:15] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin2002"
[17:42:15] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:42:16] <mutante>	 volans: oh, I missed. thank you! got it
[17:42:17] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[17:45:46] <mutante>	 volans: PS on https://phabricator.wikimedia.org/T335586#8816922
[17:46:13] <volans>	 thx
[17:46:24] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin2002"
[17:47:29] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin2002"
[17:47:30] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:47:30] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet
[18:06:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye
[18:06:44] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns2005.wikimedia.org with OS bullseye
[18:10:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye
[18:10:53] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns2005.wikimedia.org with OS bullseye
[18:11:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye
[18:11:33] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns2005.wikimedia.org with OS bullseye
[18:13:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:31:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations: DHCP error while trying to run the reimaging cookbook for dns2005.wikimedia.org (install server install2004.wikimedia.org) - https://phabricator.wikimedia.org/T336696 (10ssingh) My very (basic) attempts at debugging this: `/etc/dhcp/automation/mgmt-codfw/ssw1-a1-codfw.mgmt....
[18:33:00] <brett>	 !log Rolling out maglev LVS scheduler in eqsin - T263797
[18:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:05] <stashbot>	 T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797
[18:38:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations: DHCP error while trying to run the reimaging cookbook for dns2005.wikimedia.org (install server install2004.wikimedia.org) - https://phabricator.wikimedia.org/T336696 (10Dzahn) @ssingh How about this fix to unblock you:  - ssh install2004.wikimedia.org - edit /etc/dhcp/dhcp...
[18:44:35] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10Dzahn) has been signed - ready to go - needs patch
[18:48:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10Dzahn) a:05adee_wmde→03CDanis Hi Chris, if you wanna take over clinic duty, this just needs a patch like https://gerrit.wikimedia.org/r/c/operations/puppet/+/919225.  NDA was signe...
[18:50:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10Dzahn) LDAP uid: adri  I already added them to "wmf" and "nda" like their WMDE co-workers.
[18:52:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) 05In progress→03Resolved @KFrancis Thank you as always. I have added Adee to the LDAP groups. There will be some follow-up to do...
[18:54:04] <icinga-wm>	 PROBLEM - pybal on lvs5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[18:54:35] <sukhe>	 !log set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.48 208.80.153.10 ]: codfw row D maint 2023/05/16 [dns2002] T335042
[18:54:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:40] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[18:54:42] <mutante>	 !log LDAP - added uid 'adee' to groups wmde and nda - T336434
[18:54:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:46] <stashbot>	 T336434: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434
[18:54:48] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[18:55:04] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[18:55:04] <sukhe>	 ^ expected, brett is upgrading
[18:55:09] <mutante>	 ack, thanks
[18:56:20] <wikibugs>	 (03CR) 10Dzahn: "new PS compiles now as https://puppet-compiler.wmflabs.org/output/919834/41191/doc1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[19:01:34] <wikibugs>	 (03CR) 10Dzahn: "I wish we wouldn't have to touch all the servers, including the prod ones, to test on new non-prod servers. .." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[19:02:35] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5]: 0.3.124
[19:03:35] <inflatador>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.124` on canary `wdqs1003`; proceeding to rest of fleet
[19:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:16] <wikibugs>	 (03CR) 10Dzahn: "will deploy it regardless.. but want lunch first and watch it closely" [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[19:12:40] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5]: 0.3.124 (duration: 10m 05s)
[19:12:58] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5]: (no justification provided)
[19:15:29] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet
[19:15:51] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet
[19:18:44] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5]: (no justification provided) (duration: 05m 46s)
[19:18:47] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5]: (no justification provided)
[19:18:52] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5]: (no justification provided) (duration: 00m 05s)
[19:19:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10taavi) > The easiest way I can think of to block all of those actions would be to temporarily change the uid=novaadmin user's password. We don't have anything else that would b...
[19:19:48] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5]: (no justification provided)
[19:19:53] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5]: (no justification provided) (duration: 00m 05s)
[19:21:30] <wikibugs>	 (03PS6) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814)
[19:22:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[19:22:53] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet
[19:23:00] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet
[19:24:56] <icinga-wm>	 RECOVERY - pybal on lvs5005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:25:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:26:14] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:26:21] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5] (wcqs): deploy 0.3.124 to WCQS
[19:27:04] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:28:18] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[19:28:25] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5] (wcqs): deploy 0.3.124 to WCQS (duration: 02m 03s)
[19:28:33] <wikibugs>	 (03PS7) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814)
[19:29:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[19:29:27] <wikibugs>	 (03PS8) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814)
[19:29:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[19:32:09] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[19:32:56] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet
[19:33:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet
[19:37:14] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet
[19:40:00] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet
[19:43:31] <cdanis>	 mutante: thanks for clinic duty, I'm taking over
[19:43:57] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10bking) Moving this out of the current work, but this is still a priority for us. Will revisit next quarter.
[19:44:05] <wikibugs>	 10SRE, 10SRE-tools, 10Discovery-Search, 10Infrastructure-Foundations, 10Spicerack: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10bking)
[19:47:01] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 2:00:00 on 20 hosts with reason: T335042 maintenance
[19:47:06] <stashbot>	 T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042
[19:47:16] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 2:00:00 on 20 hosts with reason: T335042 maintenance
[19:47:16] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet
[19:49:13] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086] for row D switch upgrade - bking@cumin1001 - T335042
[19:49:13] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086] for row D switch upgrade - bking@cumin1001 - T335042
[19:49:38] <icinga-wm>	 PROBLEM - pybal on lvs5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:50:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet
[19:50:21] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086]* for row D switch upgrade - bking@cumin1001 - T335042
[19:50:25] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086]* for row D switch upgrade - bking@cumin1001 - T335042
[19:50:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[19:53:08] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5006 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[19:54:19] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet
[19:55:37] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet
[19:55:41] <wikibugs>	 (03PS1) 10Ottomata: dse-k8s-services/mediawiki-page-content-change-enrichment - fix private helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/919885 (https://phabricator.wikimedia.org/T336656)
[19:55:52] <icinga-wm>	 RECOVERY - pybal on lvs5006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:56:19] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:56:23] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[19:56:48] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:58:40] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5006 is OK: OK: 16 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal
[19:59:19] <wikibugs>	 (03PS1) 10Ottomata: dse-k8s-services/mediawiki-page-content-change-enrichment - use kafka SSL [deployment-charts] - 10https://gerrit.wikimedia.org/r/919887 (https://phabricator.wikimedia.org/T331526)
[19:59:34] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] dse-k8s-services/mediawiki-page-content-change-enrichment - use kafka SSL [deployment-charts] - 10https://gerrit.wikimedia.org/r/919887 (https://phabricator.wikimedia.org/T331526) (owner: 10Ottomata)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:24] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:00:28] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[20:02:05] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] dse-k8s-services/mediawiki-page-content-change-enrichment - fix private helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/919885 (https://phabricator.wikimedia.org/T336656) (owner: 10Ottomata)
[20:20:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[20:20:02] <wikibugs>	 10SRE: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Legoktm)
[20:20:37] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-backup: exclude non-ceph VMs from backup [puppet] - 10https://gerrit.wikimedia.org/r/919895
[20:20:39] <wikibugs>	 (03PS1) 10Andrew Bogott: remove wmcs-backup-instances script, no longer used [puppet] - 10https://gerrit.wikimedia.org/r/919896
[20:25:01] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[20:29:14] <mutante>	 cdanis: :) thanks! was trying to strike the balance there and not hand over the confusing ticket for 3 users
[20:29:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: exclude non-ceph VMs from backup [puppet] - 10https://gerrit.wikimedia.org/r/919895 (owner: 10Andrew Bogott)
[20:43:09] <wikibugs>	 (03PS1) 10CDanis: admin: add adri as ldap-only [puppet] - 10https://gerrit.wikimedia.org/r/919903 (https://phabricator.wikimedia.org/T336434)
[20:44:35] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] admin: add adri as ldap-only [puppet] - 10https://gerrit.wikimedia.org/r/919903 (https://phabricator.wikimedia.org/T336434) (owner: 10CDanis)
[20:47:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] admin: add adri as ldap-only [puppet] - 10https://gerrit.wikimedia.org/r/919903 (https://phabricator.wikimedia.org/T336434) (owner: 10CDanis)
[20:49:01] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10CDanis) 05In progress→03Resolved
[20:53:36] <wikibugs>	 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Peachey88)
[21:00:06] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T2100). nyaa~
[21:05:14] <wikibugs>	 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Dzahn) side question: so we already have an RSS feed of the outage notifications? If so, can you paste the URL.
[21:12:13] <sbassett>	 Hey all - mstyles and I have a couple of sec patches to deploy for: T335612 and T323651
[21:13:30] <mutante>	 sukhe: no LVS work?
[21:17:07] <mutante>	 sbassett: well, it's your window on the calendar and seems quiet
[21:19:05] <mutante>	 I see nothing blocking deployment in SAL either, go for it
[21:19:23] <sukhe>	 no lvs work this week, thanks for checking 
[21:19:28] <sukhe>	 all clear from our end
[21:22:29] <sukhe>	 (we block the deployment window and also put a scap lock just to be extra sure)
[21:24:24] <mutante>	 nice!
[21:25:01] <wikibugs>	 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Peachey88) It looks like they should be https://www.wikimediastatus.net/history.atom (atom) and https://www.wikimediastatus.net/history.rss (rss) and should now be enabled after {T305174}
[21:31:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Jclark-ctr)
[21:38:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: temporary config for docs publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[21:40:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deploying one host at a time on all 4 hosts and checking what it does exactly." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[21:44:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "doc2001 - noop - as designed because not active host" [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[21:48:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: temporary config for docs publishing from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[21:50:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "@Jaime - this is in place on doc1003, I see the new firewall rules that allow gitab IPs to connect to tcp 873 and 1873, I see the new rsyn" [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[21:51:17] <maryum>	 patch deployed for T335612
[21:51:43] <maryum>	 !log Deployed patch for T335612
[21:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:12] <wikibugs>	 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Dzahn) Thank you! both work for me and appear to be valid feeds :)  Should they been in planet, btw?
[22:02:27] <maryum>	 !log deployed patch for T323651
[22:02:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:04:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Advertised RSS/Atom feeds for wikimediastatus.net don't work - https://phabricator.wikimedia.org/T305174 (10Dzahn) Should I include the feed in https://en.planet.wikimedia.org/ or is it too much for that audience?
[22:09:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:13:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:13:34] <wikibugs>	 (03PS1) 10BryanDavis: perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904)
[22:13:44] <mutante>	  /me searches codesearch for the code of codesearch - https://codesearch.wmcloud.org/search/?q=codesearch&files=&excludeFiles=&repos=
[22:14:23] <mutante>	 but where is the config of codesearch in there? heh
[22:18:22] <mutante>	 I have found modules/codesearch/files/hound-config but that is not it. something tells it to use "gerrit-replica.wikimedia.org" for search but I can't find where that is configured
[22:19:37] <mutante>	 would like to make a patch that replaces "gerrit-replica" with "gerrit"
[22:20:11] <wikibugs>	 (03PS1) 10BryanDavis: perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522)
[22:20:15] <mutante>	 ah, could be that it's a cloud hiera edit
[22:21:45] <dancy>	 https://gerrit.wikimedia.org/r/plugins/gitiles/labs/codesearch/+/refs/heads/master/write_config.py#145 ?
[22:23:18] <mutante>	 dancy: oh, "labs/" ! :) thank you
[22:24:01] <mutante>	 labs will be around for a while, moving all those to /cloud/ seems a hassle
[22:24:17] <mutante>	 was it in the results of codesearch?
[22:26:56] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:31:39] <mutante>	 created https://gerrit.wikimedia.org/r/c/labs/codesearch/+/919925  but labs repos dont show on -operations
[22:39:27] <dancy>	 mutante: I happened to have a local checkout of that repo so I checked the git remote.
[22:40:54] <mutante>	 dancy: gotcha! thanks:)
[22:43:04] <wikibugs>	 (03PS5) 10BryanDavis: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel)
[22:44:47] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel)
[22:45:19] <wikibugs>	 (03Merged) 10jenkins-bot: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel)
[22:47:00] <wikibugs>	 (03PS2) 10BryanDavis: perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904)
[22:47:06] <wikibugs>	 (03PS2) 10BryanDavis: perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522)
[22:56:04] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:02:28] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 159 probes of 698 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:04:45] <mutante>	 cdanis: so, I expected to be removed from topic 4 minutes ago. it's always making me a bit uncomfortable to be on that when my shift actually ended
[23:04:58] <mutante>	 I suspect a relation to earlier cloud / bot outage
[23:06:12] <mutante>	 regarding the actual on-call shift.. nothing happened. 
[23:06:49] <mutante>	 is it still good form to hand-over to the next shift? which channel is best
[23:09:14] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 86 probes of 698 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:09:29] <mutante>	 well, an active ping just to say "nothing" doesn't seem that great either
[23:09:56] <mutante>	 if anyone could update the topic / restart the bot though, wpuld appreciate that. going afk, shift just ended if you look in VO
[23:26:14] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed