[00:35:00] (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:39:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919377 [00:39:20] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919377 (owner: 10TrainBranchBot) [00:56:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/919377 (owner: 10TrainBranchBot) [00:59:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 241.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [01:19:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 200.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [01:22:20] (03CR) 10BryanDavis: signup:blocklist Expand blocklist feature (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/919005 (owner: 10Slyngshede) [01:39:04] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: git-sync-upstream failing - https://phabricator.wikimedia.org/T336263 (10Andrew) 05Open→03Resolved [01:56:25] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:39] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,session-c8683.scope,user-runtime-dir@114.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:57] (03PS2) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918922 (https://phabricator.wikimedia.org/T327868) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:17] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,session-c8683.scope,user-runtime-dir@114.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:53] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,session-c8683.scope,user-runtime-dir@114.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:16:31] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-appledora-singleuser.service,jupyter-dsaez-singleuser-conda-analytics.service,session-c8682.scope,session-c8683.scope,user-runtime-dir@114.service,user@114.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:34] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [02:32:03] (03PS1) 10Andrew Bogott: mwopenstackclients: replace use of os_client_config with openstack.config [puppet] - 10https://gerrit.wikimedia.org/r/919471 (https://phabricator.wikimedia.org/T336104) [03:39:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:44:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:35:00] (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:49:53] PROBLEM - SSH on bast4004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:51:29] RECOVERY - SSH on bast4004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:36:23] <_joe_> !log building bookworm image for the first time T335560 [05:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:28] T335560: Publish Wikimedia bookworm base Docker image - https://phabricator.wikimedia.org/T335560 [05:42:45] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:43] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:53:37] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:56:33] RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (db2139) taken on 2023-05-15 02:31:57 (1289 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:01:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:14:31] PROBLEM - SSH on bast4004 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:16:05] RECOVERY - SSH on bast4004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:17:38] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Access port speed <= 100Mbps False positives - https://phabricator.wikimedia.org/T336511 (10ayounsi) [06:18:24] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Access port speed <= 100Mbps False positives - https://phabricator.wikimedia.org/T336511 (10ayounsi) I muted the alert for now until we can get to the bottom of it as it was spamming too much. [06:19:48] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560) [06:28:35] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [06:33:34] (Access port speed <= 100Mbps) resolved: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [06:38:46] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41177/console" [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto) [07:00:05] Amir1, Urbanecm, and taavi: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:05:12] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10jcrespo) hi, cloudcontrol2001-dev is failing to do all its backups. Usually this is due to maintenance or a defect on setup: ` root@backup1001:~$ check_... [07:05:15] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10jcrespo) I can file this on a separate ticket, if needed. [07:08:44] (03PS4) 10Elukey: service::catalog: set lvs_setup for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) [07:08:58] 10SRE, 10ops-codfw, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10jcrespo) I see what is going on- the backups are happening, but they return empty- which is a weird setup and we interpret as a failure (not intended). We... [07:10:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41178/console" [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [07:13:48] (03PS1) 10Jcrespo: bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up) [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236) [07:16:22] (03CR) 10Ayounsi: "Would it make sens to just ignore all the dev/test hosts from backups?" [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236) (owner: 10Jcrespo) [07:19:05] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) (owner: 10Dzahn) [07:20:25] (03CR) 10Jcrespo: bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236) (owner: 10Jcrespo) [07:23:18] (03CR) 10Jelto: [C: 03+2] microsites: change rewrite rule for https://transparency.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/918594 (https://phabricator.wikimedia.org/T336301) (owner: 10Dzahn) [07:35:43] (03PS1) 10Volans: CHANGELOG: add changelogs for release v7.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/919783 [07:39:48] (03PS1) 10Elukey: Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) [07:40:34] (03CR) 10CI reject: [V: 04-1] Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [07:40:45] (03PS4) 10David Caro: d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) [07:40:49] (03PS1) 10David Caro: Fix tests to adapt the latest toolforge-weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919786 [07:40:55] (03PS1) 10David Caro: utils: allow specifying the version for bump_version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919787 [07:41:29] (03CR) 10David Caro: d/changelog: prepare release 0.98 (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [07:42:04] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v7.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/919783 (owner: 10Volans) [07:42:47] (03PS2) 10Slyngshede: Search: add function for search users. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919073 (https://phabricator.wikimedia.org/T335476) [07:43:52] (03CR) 10CI reject: [V: 04-1] Search: add function for search users. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919073 (https://phabricator.wikimedia.org/T335476) (owner: 10Slyngshede) [07:47:12] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v7.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/919783 (owner: 10Volans) [07:49:49] (03PS1) 10Volans: Upstream release v7.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/919789 [07:49:57] (03PS3) 10Slyngshede: Search: add function for search users. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/919073 (https://phabricator.wikimedia.org/T335476) [07:51:25] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf: Fix excimer_mysql_user typo [puppet] - 10https://gerrit.wikimedia.org/r/919422 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [07:54:29] (03PS1) 10Stevemunene: Add stat1009 dummykeytabs [labs/private] - 10https://gerrit.wikimedia.org/r/919791 (https://phabricator.wikimedia.org/T336036) [07:54:33] 10SRE, 10WMF-Legal, 10serviceops-collab, 10wikimediafoundation.org: Update redirect for transparency.wikimedia.org - https://phabricator.wikimedia.org/T336301 (10Jelto) 05In progress→03Resolved a:03Dzahn Change merged, redirect points to https://wikimediafoundation.org/about/transparency/current/ no... [07:57:04] (03CR) 10Volans: [C: 03+2] Upstream release v7.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/919789 (owner: 10Volans) [08:02:14] (03Merged) 10jenkins-bot: Upstream release v7.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/919789 (owner: 10Volans) [08:04:16] 10SRE, 10ops-codfw, 10DBA, 10database-backups: db2139 s4 (commonswiki) instance crashed (backup source) - https://phabricator.wikimedia.org/T335396 (10jcrespo) 05Open→03Resolved I've put db2139 back into service, no errors or issues observed so far. I reloaded all data from the most recent backup. Than... [08:08:14] !log uploaded spicerack_7.1.0 to apt.wikimedia.org bullseye-wikimedia [08:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:33] (03CR) 10Stevemunene: [C: 03+2] Add stat1009 dummykeytabs [labs/private] - 10https://gerrit.wikimedia.org/r/919791 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [08:09:44] (03PS1) 10Volans: sre.hosts.provision: adapt call to DHCPConfMgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/919792 [08:09:46] (03PS1) 10Volans: sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) [08:09:52] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add stat1009 dummykeytabs [labs/private] - 10https://gerrit.wikimedia.org/r/919791 (https://phabricator.wikimedia.org/T336036) (owner: 10Stevemunene) [08:11:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [08:12:33] (03CR) 10CI reject: [V: 04-1] sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [08:16:12] (03PS2) 10Volans: sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) [08:21:32] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.provision: adapt call to DHCPConfMgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/919792 (owner: 10Volans) [08:22:51] !log installed spicerack_7.1.0 on cumin2002 [08:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:00] (03CR) 10Elukey: [V: 03+1 C: 03+2] service::catalog: set lvs_setup for k8s-ingress-ml-staging [puppet] - 10https://gerrit.wikimedia.org/r/918409 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [08:26:06] !log installed spicerack_7.1.0 on cumin1001 [08:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:19] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: adapt call to DHCPConfMgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/919792 (owner: 10Volans) [08:26:44] !log restart pybal on lvs2010 and lvs2009 to pick up new LVS VIP for ml-staging k8s ingress - T335756 [08:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:47] T335756: Create a staging ingress configuration for ml-staging-codfw - https://phabricator.wikimedia.org/T335756 [08:28:37] (03Merged) 10jenkins-bot: sre.hosts.provision: adapt call to DHCPConfMgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/919792 (owner: 10Volans) [08:30:00] (03PS5) 10Filippo Giunchedi: prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) [08:30:01] pybal on 2010 looks good, proceeding with 2009 [08:32:44] aand lvs2009 done [08:35:00] (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:37:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:39:26] 10SRE, 10ops-codfw, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): cloudcontrol2001-dev: make it a cloudlb backend - https://phabricator.wikimedia.org/T336236 (10aborrero) >>! In T336236#8849618, @jcrespo wrote: > I see what is going on- the backups are happening, but they return empty- which... [08:39:40] (03PS6) 10Filippo Giunchedi: prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) [08:40:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236) (owner: 10Jcrespo) [08:42:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:42:58] (03PS1) 10Elukey: service::catalog: switch k8s-ingress-ml-staging to production [puppet] - 10https://gerrit.wikimedia.org/r/919795 (https://phabricator.wikimedia.org/T335756) [08:45:12] !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:45:25] !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:47:13] (03CR) 10Jcrespo: [C: 03+2] bacula: Ignore cloudcontrol2001-dev backup errors (no file backed up) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919781 (https://phabricator.wikimedia.org/T336236) (owner: 10Jcrespo) [08:47:22] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41180/console" [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi) [08:49:13] (03CR) 10Ayounsi: sre.network.provision: add new cookbook (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [08:50:12] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41181/console" [puppet] - 10https://gerrit.wikimedia.org/r/919795 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [08:53:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:54:12] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 2 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) I have now been provided with the following two Wikitech accounts for the two users: * Thom... [08:57:29] (03CR) 10David Caro: [C: 03+2] P:toolforge::proxy: uninstall toolsweblogster [puppet] - 10https://gerrit.wikimedia.org/r/917923 (owner: 10Majavah) [08:58:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:01:12] (03CR) 10Klausman: [C: 03+1] service::catalog: switch k8s-ingress-ml-staging to production [puppet] - 10https://gerrit.wikimedia.org/r/919795 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [09:05:03] (03CR) 10Filippo Giunchedi: [V: 03+1] prometheus: don't fail on unknown blackbox probe type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi) [09:05:18] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [09:06:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] service::catalog: switch k8s-ingress-ml-staging to production [puppet] - 10https://gerrit.wikimedia.org/r/919795 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [09:08:42] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [09:11:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1020.eqiad.wmnet with reason: Maintenance [09:11:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1020.eqiad.wmnet with reason: Maintenance [09:11:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1020 (T335845)', diff saved to https://phabricator.wikimedia.org/P48217 and previous config saved to /var/cache/conftool/dbconfig/20230515-091139-ladsgroup.json [09:12:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/918547 (https://phabricator.wikimedia.org/T320620) (owner: 10Cwhite) [09:12:57] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:13:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:14:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:14:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:14:33] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:14:56] (03CR) 10Volans: "Thanks for the quick review, addressed comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:14:57] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:14:59] 10SRE, 10Platform Engineering, 10Release Pipeline, 10serviceops, and 4 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10hashar) I have filed T335780 with the goal of updating the `docker.wikimedia.org/releng/cassandra311` image which... [09:16:07] (03PS3) 10Volans: sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) [09:25:00] (03CR) 10Elukey: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [09:27:36] 10SRE, 10Release-Engineering-Team, 10ci-test-error: tox-docker CI test doesn't pick up overrides for pylint - https://phabricator.wikimedia.org/T281347 (10hashar) 05Open→03Resolved After 2 years, I am assuming this one got solved since some change got merged, we upgraded pip/tox etc since then. See also... [09:28:11] (03CR) 10Ayounsi: [C: 03+1] "Not tested but overall lgtm." [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [09:28:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1020 (T335845)', diff saved to https://phabricator.wikimedia.org/P48218 and previous config saved to /var/cache/conftool/dbconfig/20230515-092810-ladsgroup.json [09:28:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT endpoints) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:33:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT endpoints) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:37:20] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech, 10wmde-wikidata-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10Ladsgroup) Until it gets changed to HTTPS, basically we have two options: - Remove the link from sidebar and add it as text field in act... [09:38:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15802 [09:39:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15802 [09:39:54] (03PS1) 10Majavah: varnishkafka: remove absented logster integration [puppet] - 10https://gerrit.wikimedia.org/r/919802 [09:39:56] (03PS1) 10Majavah: P:toolforge::proxy: remove absented logster resources [puppet] - 10https://gerrit.wikimedia.org/r/919803 [09:39:58] (03PS1) 10Majavah: logster: remove classes [puppet] - 10https://gerrit.wikimedia.org/r/919804 [09:40:32] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MatthewVernon) [09:43:12] (03CR) 10Majavah: "PCC: https://puppet-compiler.wmflabs.org/output/919802/41182/, deployment-prep issues seem unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/919802 (owner: 10Majavah) [09:43:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1020', diff saved to https://phabricator.wikimedia.org/P48219 and previous config saved to /var/cache/conftool/dbconfig/20230515-094317-ladsgroup.json [09:49:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1123 T334910', diff saved to https://phabricator.wikimedia.org/P48220 and previous config saved to /var/cache/conftool/dbconfig/20230515-094938-ladsgroup.json [09:49:43] T334910: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910 [09:50:11] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:51:56] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Create Benthos image - https://phabricator.wikimedia.org/T336658 (10kamila) [09:52:02] (03PS1) 10Ladsgroup: conftool-data: Remove db1123 [puppet] - 10https://gerrit.wikimedia.org/r/919805 (https://phabricator.wikimedia.org/T334910) [09:52:19] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Create Benthos image - https://phabricator.wikimedia.org/T336658 (10kamila) 05Open→03In progress [09:52:23] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10kamila) [09:53:12] (03CR) 10Ladsgroup: [C: 03+2] conftool-data: Remove db1123 [puppet] - 10https://gerrit.wikimedia.org/r/919805 (https://phabricator.wikimedia.org/T334910) (owner: 10Ladsgroup) [09:53:36] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Create Benthos docker image - https://phabricator.wikimedia.org/T336658 (10kamila) [09:54:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Remove db1123 from dbctl T334910', diff saved to https://phabricator.wikimedia.org/P48221 and previous config saved to /var/cache/conftool/dbconfig/20230515-095412-ladsgroup.json [09:56:24] (03PS1) 10Ladsgroup: Remove db1123 [puppet] - 10https://gerrit.wikimedia.org/r/919807 (https://phabricator.wikimedia.org/T334910) [09:58:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1020', diff saved to https://phabricator.wikimedia.org/P48222 and previous config saved to /var/cache/conftool/dbconfig/20230515-095823-ladsgroup.json [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1000) [10:01:16] (03PS2) 10Elukey: Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) [10:02:12] (03PS1) 10Hnowlan: admin_ng, thumbor: double memory limit for namespace and pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/919808 (https://phabricator.wikimedia.org/T334488) [10:02:48] (03CR) 10Filippo Giunchedi: "Untested but overall LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [10:05:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1123.eqiad.wmnet [10:11:01] PROBLEM - Check systemd state on db2184 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:03] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [10:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:13:08] !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1123.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [10:13:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1020 (T335845)', diff saved to https://phabricator.wikimedia.org/P48223 and previous config saved to /var/cache/conftool/dbconfig/20230515-101329-ladsgroup.json [10:14:19] (03CR) 10Kamila Součková: [C: 03+1] admin_ng, thumbor: double memory limit for namespace and pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/919808 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [10:15:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1123.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [10:15:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:15:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1123.eqiad.wmnet [10:18:43] (03CR) 10Ladsgroup: [C: 03+2] Remove db1123 [puppet] - 10https://gerrit.wikimedia.org/r/919807 (https://phabricator.wikimedia.org/T334910) (owner: 10Ladsgroup) [10:19:34] !log Removing db1123 from zarcillo T334910 [10:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:38] T334910: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910 [10:20:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance [10:20:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1023.eqiad.wmnet with reason: Maintenance [10:20:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1023 (T335845)', diff saved to https://phabricator.wikimedia.org/P48224 and previous config saved to /var/cache/conftool/dbconfig/20230515-102038-ladsgroup.json [10:21:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:52] (03CR) 10Ayounsi: Validators: improve device name, add interface/outlet (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [10:25:19] (03PS5) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) [10:26:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:31:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023 (T335845)', diff saved to https://phabricator.wikimedia.org/P48225 and previous config saved to /var/cache/conftool/dbconfig/20230515-103105-ladsgroup.json [10:31:26] (03CR) 10JMeybohm: prometheus/k8s: add selective scraping of ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:36:02] (03CR) 10JMeybohm: prometheus/k8s: add selective scraping of ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [10:39:52] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1123.eqiad.wmnet - https://phabricator.wikimedia.org/T334910 (10Ladsgroup) a:05Ladsgroup→03wiki_willy [10:46:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023', diff saved to https://phabricator.wikimedia.org/P48226 and previous config saved to /var/cache/conftool/dbconfig/20230515-104611-ladsgroup.json [10:56:41] (03PS2) 10Giuseppe Lavagetto: prometheus/k8s: add selective scraping of ports in staging [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) [10:56:53] (03CR) 10Giuseppe Lavagetto: prometheus/k8s: add selective scraping of ports in staging (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [10:58:56] (03CR) 10CI reject: [V: 04-1] prometheus/k8s: add selective scraping of ports in staging [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [11:01:05] (03PS6) 10JMeybohm: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) [11:01:15] (03PS9) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T300324) [11:01:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023', diff saved to https://phabricator.wikimedia.org/P48227 and previous config saved to /var/cache/conftool/dbconfig/20230515-110118-ladsgroup.json [11:01:20] (03PS5) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) [11:02:28] (03PS3) 10Giuseppe Lavagetto: prometheus/k8s: add selective scraping of ports in staging [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) [11:03:19] (03CR) 10JMeybohm: [C: 03+2] Allow basic validation of envoy config in CI [puppet] - 10https://gerrit.wikimedia.org/r/915745 (https://phabricator.wikimedia.org/T304660) (owner: 10JMeybohm) [11:03:21] (03CR) 10JMeybohm: [C: 03+2] envoyproxy: Fix most validation errors in the `good` build_envoy_config tests [puppet] - 10https://gerrit.wikimedia.org/r/773642 (https://phabricator.wikimedia.org/T304660) (owner: 10RLazarus) [11:16:18] (03PS1) 10Btullis: Add two ldap_only users from bishopfox [puppet] - 10https://gerrit.wikimedia.org/r/919822 (https://phabricator.wikimedia.org/T336357) [11:16:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1023 (T335845)', diff saved to https://phabricator.wikimedia.org/P48228 and previous config saved to /var/cache/conftool/dbconfig/20230515-111624-ladsgroup.json [11:19:04] (03CR) 10Btullis: "This is to replace: https://gerrit.wikimedia.org/r/c/operations/puppet/+/918519" [puppet] - 10https://gerrit.wikimedia.org/r/919822 (https://phabricator.wikimedia.org/T336357) (owner: 10Btullis) [11:24:52] (03CR) 10Majavah: [C: 03+1] "see inline" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/918610 (https://phabricator.wikimedia.org/T258841) (owner: 10BryanDavis) [11:25:11] (03Abandoned) 10Majavah: P:base::firewall: add non-etcd way to reject traffic [puppet] - 10https://gerrit.wikimedia.org/r/918408 (owner: 10Majavah) [11:26:58] (03PS1) 10Stevemunene: Bring stat1009 into service [puppet] - 10https://gerrit.wikimedia.org/r/919826 (https://phabricator.wikimedia.org/T336036) [11:43:45] RECOVERY - Checks that the airflow database for airflow analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:44:13] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:45:45] (03CR) 10Volans: "replies inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [12:13:57] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM Filippo thanks for the help here! I'm happy to roll it out with current warning level and see how we get on?" [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [12:16:23] (03PS1) 10Jaime Nuche: doc: add rsync password [labs/private] - 10https://gerrit.wikimedia.org/r/919833 (https://phabricator.wikimedia.org/T336168) [12:16:48] (03PS6) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) [12:16:50] (03PS1) 10Jaime Nuche: doc: temporary config for docs publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) [12:18:31] (03CR) 10Ayounsi: Validators: improve device name, add interface/outlet (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [12:19:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Verification can happen at https://prometheus-eqiad.wikimedia.org/k8s-staging/" [puppet] - 10https://gerrit.wikimedia.org/r/919053 (https://phabricator.wikimedia.org/T271822) (owner: 10Giuseppe Lavagetto) [12:23:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "Agreed, let's ship it!" [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [12:27:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15802 [12:28:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15802 [12:35:01] (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:35:15] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10MatthewVernon) [12:35:50] (03PS2) 10Jgreen: Add frav1003 dns and rdns entries [dns] - 10https://gerrit.wikimedia.org/r/919239 (https://phabricator.wikimedia.org/T334400) (owner: 10Dwisehaupt) [12:39:50] (03CR) 10Jgreen: [C: 03+2] Add frav1003 dns and rdns entries [dns] - 10https://gerrit.wikimedia.org/r/919239 (https://phabricator.wikimedia.org/T334400) (owner: 10Dwisehaupt) [12:40:07] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10Ottomata) This is probably Data Eng SRE responsibility! Tagged. [12:40:13] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10Ottomata) [12:41:11] (03CR) 10Jgreen: [C: 03+2] Add donorpreferences, delete dash CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/919423 (https://phabricator.wikimedia.org/T335793) (owner: 10Dwisehaupt) [12:41:20] (03PS2) 10Jgreen: Add donorpreferences, delete dash CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/919423 (https://phabricator.wikimedia.org/T335793) (owner: 10Dwisehaupt) [12:41:30] (03CR) 10Jgreen: [V: 03+2] Add donorpreferences, delete dash CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/919423 (https://phabricator.wikimedia.org/T335793) (owner: 10Dwisehaupt) [12:41:45] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Jelto) [12:42:32] (03CR) 10Jgreen: [C: 03+2] Add donorpreferences, delete dash CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/919423 (https://phabricator.wikimedia.org/T335793) (owner: 10Dwisehaupt) [12:49:40] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 4 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10JArguello-WMF) [12:51:13] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10Ottomata) > Basically add a configuration that allows to figure out article_url => derived urls as a mapping, and eliminate all the n... [12:53:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 242k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [12:55:01] (03CR) 10JMeybohm: [C: 03+2] mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:55:03] (03CR) 10JMeybohm: [C: 03+2] Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:55:06] (03CR) 10JMeybohm: [C: 03+2] Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:55:48] (03Merged) 10jenkins-bot: Run envoy config validation against scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:55:50] (03Merged) 10jenkins-bot: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:55:53] (03Merged) 10jenkins-bot: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:58:23] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:58:54] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:59:19] (03PS1) 10Daimona Eaytoy: Configure logging for the CampaignEvents channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919838 (https://phabricator.wikimedia.org/T320434) [13:00:04] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:31] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:03:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 201.9k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [13:07:17] (03CR) 10Ottomata: Add flink-app default log config and use it in page_content_change (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/917999 (https://phabricator.wikimedia.org/T335802) (owner: 10TChin) [13:07:49] (03PS3) 10Elukey: Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) [13:09:31] (03CR) 10JMeybohm: [C: 03+1] Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [13:10:07] (03CR) 10Elukey: [C: 03+2] Add discovery settings for k8s-ingress-ml-staging [dns] - 10https://gerrit.wikimedia.org/r/919785 (https://phabricator.wikimedia.org/T335756) (owner: 10Elukey) [13:22:22] (03CR) 10David Caro: [C: 03+2] Fix tests to adapt the latest toolforge-weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919786 (owner: 10David Caro) [13:22:26] (03CR) 10David Caro: [C: 03+2] utils: allow specifying the version for bump_version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919787 (owner: 10David Caro) [13:22:30] (03CR) 10David Caro: [C: 03+2] d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [13:23:23] (03Merged) 10jenkins-bot: Fix tests to adapt the latest toolforge-weld [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919786 (owner: 10David Caro) [13:23:25] (03Merged) 10jenkins-bot: utils: allow specifying the version for bump_version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919787 (owner: 10David Caro) [13:23:54] (03PS2) 10Majavah: ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 [13:24:11] (03Merged) 10jenkins-bot: d/changelog: prepare release 0.98 [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/919356 (https://phabricator.wikimedia.org/T336507) (owner: 10David Caro) [13:24:30] (03CR) 10CI reject: [V: 04-1] ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [13:24:41] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbook not forcing a Puppet agent run on lvs2011, lvs2012 - https://phabricator.wikimedia.org/T336593 (10Volans) @ssingh I'm not sure I agree with your renaming. What do you mean exactly? :) The only race I can think of is that the removal from pup... [13:25:30] (03PS3) 10Majavah: ferm::service: allow passing array of hosts [puppet] - 10https://gerrit.wikimedia.org/r/919300 [13:26:59] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41185/console" [puppet] - 10https://gerrit.wikimedia.org/r/919300 (owner: 10Majavah) [13:28:21] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence-Backup: db1225 crashed (CPU 1 machine check error detected) - https://phabricator.wikimedia.org/T336326 (10Jclark-ctr) 05Open→03Resolved Server has not had any errors since flea power drain closing ticket [13:32:14] (03CR) 10Volans: [C: 03+2] installserver: enable ZTP for network devices [puppet] - 10https://gerrit.wikimedia.org/r/919076 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [13:32:32] (03CR) 10Volans: [C: 03+2] install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [13:32:43] (03PS8) 10Volans: install_server: simplify DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/919276 (https://phabricator.wikimedia.org/T336485) [13:33:24] !log disabling puppet on the install hosts to deploy changes for T336485 [13:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:29] T336485: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 [13:34:32] (03PS14) 10Volans: install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) [13:36:08] (03CR) 10Volans: [C: 03+2] install_server: convert dhcpd.conf to template [puppet] - 10https://gerrit.wikimedia.org/r/919277 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [13:36:26] (03PS6) 10Volans: install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) [13:37:40] (03CR) 10Volans: [C: 03+2] install_server: remove mgmt subnet already managed [puppet] - 10https://gerrit.wikimedia.org/r/919282 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [13:38:58] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging cookbook not forcing a Puppet agent run on lvs2011, lvs2012 - https://phabricator.wikimedia.org/T336593 (10ssingh) >>! In T336593#8850410, @Volans wrote: > @ssingh I'm not sure I agree with your renaming. What do you mean exactly? :) "Reimaging cookb... [13:42:09] (03PS1) 10Klausman: hiera/k8s/ml: add namespace permissions for revertrisk [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124) [13:42:50] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis) a:03BTullis I'll have a look at this one. [13:43:28] (03PS1) 10Klausman: k8s/ml/prod: Add revertrisk namespace and permissions, plus TLS config [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) [13:45:41] (03CR) 10Btullis: [C: 03+2] Add two ldap_only users from bishopfox [puppet] - 10https://gerrit.wikimedia.org/r/919822 (https://phabricator.wikimedia.org/T336357) (owner: 10Btullis) [13:45:58] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [13:46:18] (03PS1) 10Klausman: hiera/k8s/ml: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845 [13:46:24] (03PS1) 10Andrew Bogott: galera: greatly increase max connections [puppet] - 10https://gerrit.wikimedia.org/r/919846 [13:46:30] (03CR) 10Jelto: [C: 03+1] "looks good to me. But let's get a second opinion from ServiceOps" [puppet] - 10https://gerrit.wikimedia.org/r/919350 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney) [13:47:18] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney) [13:47:25] (03PS2) 10Klausman: hiera/k8s/ml: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845 [13:48:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:48:56] (03CR) 10Majavah: [C: 03+1] galera: greatly increase max connections [puppet] - 10https://gerrit.wikimedia.org/r/919846 (owner: 10Andrew Bogott) [13:49:10] (03CR) 10Andrew Bogott: [C: 03+2] galera: greatly increase max connections [puppet] - 10https://gerrit.wikimedia.org/r/919846 (owner: 10Andrew Bogott) [13:49:19] 10SRE, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ssingh) [13:50:13] (03CR) 10Jelto: [C: 03+2] gitlab: make sure letsencrypt extention is disabled [puppet] - 10https://gerrit.wikimedia.org/r/919022 (https://phabricator.wikimedia.org/T336476) (owner: 10Jelto) [13:51:38] (03PS1) 10Ssingh: hiera: temporarily remove dns2002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/919847 (https://phabricator.wikimedia.org/T335042) [13:53:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) I've tested a reimage of a physical host and worked fine, we still have a bit of duplication of requests, do you s... [13:53:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) [13:54:14] (03CR) 10Volans: [C: 03+2] sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [13:56:47] (03CR) 10Elukey: [C: 03+1] hiera/k8s/ml: Add dummy data for revertrisk user (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/919845 (owner: 10Klausman) [13:56:58] !log re-enabled puppet on the install hosts to deploy changes for T336485 [13:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:02] T336485: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 [13:57:12] (03Merged) 10jenkins-bot: sre.network.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/919793 (https://phabricator.wikimedia.org/T336485) (owner: 10Volans) [13:57:39] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10daniel) >>! In T324200#8850266, @Ottomata wrote: >> Basically add a configuration that allows to figure out article_url => derived ur... [13:57:57] (03CR) 10Elukey: [C: 03+1] hiera/k8s/ml: add namespace permissions for revertrisk (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [13:57:59] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 4 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) I have now added `uid=twilsonbf` and `uid=ryan-bf` to the `nda` group in LDAP. ` btullis@mwm... [13:58:17] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10LDAP-Access-Requests, and 4 others: Grant temporary access to web based Data Engineering tools to Bishop Fox - https://phabricator.wikimedia.org/T336357 (10BTullis) [13:58:26] (03PS3) 10Klausman: role::mlserve: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845 [13:58:40] (03CR) 10Klausman: role::mlserve: Add dummy data for revertrisk user (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/919845 (owner: 10Klausman) [13:58:53] (03CR) 10Elukey: k8s/ml/prod: Add revertrisk namespace and permissions, plus TLS config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [14:00:06] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [14:00:11] (03PS2) 10Klausman: admin_ng/ml-serve: Add revertrisk namespace and permissions, plus TLS config [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) [14:00:44] (03PS2) 10Klausman: admin_ng/ml-serve: add namespace permissions for revertrisk [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124) [14:00:47] (03CR) 10Klausman: admin_ng/ml-serve: add namespace permissions for revertrisk (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [14:00:53] (03CR) 10Klausman: admin_ng/ml-serve: Add revertrisk namespace and permissions, plus TLS config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [14:01:12] (03CR) 10Elukey: admin_ng/ml-serve: Add revertrisk namespace and permissions, plus TLS config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [14:01:57] (03PS3) 10Klausman: admin_ng: add revertrisk namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) [14:02:01] (03CR) 10Klausman: admin_ng: add revertrisk namespace and config to ml-serve clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [14:03:16] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [14:03:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Jclark-ctr) [14:03:25] (03CR) 10Elukey: [C: 03+1] admin_ng: add revertrisk namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [14:03:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Jclark-ctr) snapshot1016 C6 U26 port 24 cableid 3230 snapshot1017 D6 U39. port 39 Cableid. 23000051 [14:06:23] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 (10Jclark-ctr) [14:06:31] 10SRE, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1110.eqiad.wmnet - https://phabricator.wikimedia.org/T335011 (10Jclark-ctr) 05Stalled→03Resolved [14:06:35] (03CR) 10Klausman: [C: 03+2] role::mlserve: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845 (owner: 10Klausman) [14:06:39] (03CR) 10Klausman: [V: 03+2 C: 03+2] role::mlserve: Add dummy data for revertrisk user [labs/private] - 10https://gerrit.wikimedia.org/r/919845 (owner: 10Klausman) [14:06:41] (03PS1) 10JMeybohm: Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324) [14:06:43] (03PS1) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) [14:07:27] (03CR) 10CI reject: [V: 04-1] Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [14:08:29] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Improve Thumbor's memcached infrastructure - https://phabricator.wikimedia.org/T318695 (10jijiki) [14:08:32] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10LSobanski) a:03Arnoldokoth [14:08:40] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Improve Thumbor's memcached infrastructure - https://phabricator.wikimedia.org/T318695 (10jijiki) [14:08:44] (03CR) 10Klausman: [C: 03+2] admin_ng/ml-serve: add namespace permissions for revertrisk [puppet] - 10https://gerrit.wikimedia.org/r/919842 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [14:08:48] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki) [14:08:52] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki) [14:09:02] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: upgrade/replace VRTS (formerly ORTS) buster to bullseye - https://phabricator.wikimedia.org/T295416 (10LSobanski) p:05Medium→03High [14:09:26] (03CR) 10Klausman: [C: 03+2] admin_ng: add revertrisk namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [14:09:28] (03PS2) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) [14:10:29] (03PS4) 10Herron: profile::arclamp::redis: introduce/move arclamp redis config to profile [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) [14:11:44] (03Merged) 10jenkins-bot: admin_ng: add revertrisk namespace and config to ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/919843 (https://phabricator.wikimedia.org/T333124) (owner: 10Klausman) [14:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:13:13] (03CR) 10Giuseppe Lavagetto: "LGTM once you've also bumped chart versions 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [14:16:08] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:17:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:08] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:19:43] (03CR) 10Herron: [C: 03+2] profile::arclamp::redis: introduce/move arclamp redis config to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919163 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [14:20:12] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:20:37] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:21:04] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [14:22:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:11] (03CR) 10Volans: sre.hosts.reimage: merge reimage cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [14:24:35] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: testing transferpy cookbook [14:24:37] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs2021.codfw.wmnet with reason: testing transferpy cookbook [14:24:48] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560) [14:25:05] (03PS19) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 [14:25:31] (03CR) 10Slyngshede: sre.hosts.reimage: merge reimage cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [14:26:53] (03CR) 10Volans: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede) [14:34:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker::baseimages: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto) [14:43:25] (03PS20) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [14:44:17] (03CR) 10CI reject: [V: 04-1] Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [14:45:32] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10Joe) >>! In T324200#8850266, @Ottomata wrote: >> Basically add a configuration that allows to figure out article_url => derived urls... [14:46:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: support bookworm [puppet] - 10https://gerrit.wikimedia.org/r/919611 (https://phabricator.wikimedia.org/T335560) (owner: 10Giuseppe Lavagetto) [14:47:26] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Improve Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki) [14:48:06] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Improve Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki) [14:48:43] (03CR) 10AOkoth: [C: 03+2] site: revert vrts2001 role post re-image [puppet] - 10https://gerrit.wikimedia.org/r/918486 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [15:00:07] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete [15:00:20] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: Setup Incomplete [15:00:36] RECOVERY - Check systemd state on db2184 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:22] (03PS2) 10JMeybohm: Update charts from mesh.configuration 1.2.0 to 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919848 (https://phabricator.wikimedia.org/T300324) [15:02:24] (03PS3) 10JMeybohm: Update charts from mesh.configuration 1.1 to 1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/919849 (https://phabricator.wikimedia.org/T300324) [15:02:59] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) @Volans is it possible to have a full pcap of those `unknown network segment` ? [15:04:52] (03PS1) 10Alexandros Kosiaris: vrts1001: Switch to insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/919856 [15:04:55] (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [15:05:16] (03CR) 10Jaime Nuche: [C: 04-1] "New secret needs to be added before this patch can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1530). [15:30:39] * Krinkle is staging on mwdebug1002 [15:39:16] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:40:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:51:26] hmmm wikibugs quit 40 minutes ago [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:50] vgutierrez: I get the impression that something is wrong with toolforge, because https://versions.toolforge.org/ isnt' working either [15:54:47] hmm kinda expected apparently after checking #wikimedia-cloud topic: "Status: system instability due to ongoing maintenance" [15:55:30] aha! [15:55:35] Thanks for finding that. :-) [15:57:45] * Krinkle done testing on mwdebug1002 [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:30:10] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:01] (NodeTextfileStale) firing: Stale textfile for testvm2005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1700) [17:00:05] ryankemper: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T1700). [17:15:11] !log volans@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [17:15:12] !log volans@cumin2002 START - Cookbook sre.dns.netbox [17:26:15] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin2002" [17:27:20] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin2002" [17:27:20] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:27:21] !log volans@cumin2002 START - Cookbook sre.dns.netbox [17:29:21] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin2002" [17:30:23] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin2002" [17:30:23] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:30:24] !log volans@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [17:39:11] !log volans@cumin2002 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [17:39:13] !log volans@cumin2002 START - Cookbook sre.dns.netbox [17:41:14] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin2002" [17:41:26] I don't see my gerrit comments on this channel anymore. [17:41:55] mutante: see above, bots missing because of cloud issues [17:42:07] I don't know the current situation [17:42:15] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - volans@cumin2002" [17:42:15] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:42:16] volans: oh, I missed. thank you! got it [17:42:17] !log volans@cumin2002 START - Cookbook sre.dns.netbox [17:45:46] volans: PS on https://phabricator.wikimedia.org/T335586#8816922 [17:46:13] thx [17:46:24] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin2002" [17:47:29] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - volans@cumin2002" [17:47:30] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:47:30] !log volans@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [18:06:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye [18:06:44] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns2005.wikimedia.org with OS bullseye [18:10:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye [18:10:53] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns2005.wikimedia.org with OS bullseye [18:11:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns2005.wikimedia.org with OS bullseye [18:11:33] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns2005.wikimedia.org with OS bullseye [18:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:31:58] 10SRE, 10Infrastructure-Foundations: DHCP error while trying to run the reimaging cookbook for dns2005.wikimedia.org (install server install2004.wikimedia.org) - https://phabricator.wikimedia.org/T336696 (10ssingh) My very (basic) attempts at debugging this: `/etc/dhcp/automation/mgmt-codfw/ssw1-a1-codfw.mgmt.... [18:33:00] !log Rolling out maglev LVS scheduler in eqsin - T263797 [18:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:05] T263797: Switch Source Hashing ('sh') scheduling on LVS hosts to Maglev hashing ('mh') - https://phabricator.wikimedia.org/T263797 [18:38:49] 10SRE, 10Infrastructure-Foundations: DHCP error while trying to run the reimaging cookbook for dns2005.wikimedia.org (install server install2004.wikimedia.org) - https://phabricator.wikimedia.org/T336696 (10Dzahn) @ssingh How about this fix to unblock you: - ssh install2004.wikimedia.org - edit /etc/dhcp/dhcp... [18:44:35] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10Dzahn) has been signed - ready to go - needs patch [18:48:38] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10Dzahn) a:05adee_wmde→03CDanis Hi Chris, if you wanna take over clinic duty, this just needs a patch like https://gerrit.wikimedia.org/r/c/operations/puppet/+/919225. NDA was signe... [18:50:01] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10Dzahn) LDAP uid: adri I already added them to "wmf" and "nda" like their WMDE co-workers. [18:52:33] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Adee Ritman (WMDE), Robert Timm (WMDE) and Loren Johnson (WMDE) - https://phabricator.wikimedia.org/T335941 (10Dzahn) 05In progress→03Resolved @KFrancis Thank you as always. I have added Adee to the LDAP groups. There will be some follow-up to do... [18:54:04] PROBLEM - pybal on lvs5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:54:35] !log set routing-options static route 208.80.153.231/32 next-hop [ 208.80.153.48 208.80.153.10 ]: codfw row D maint 2023/05/16 [dns2002] T335042 [18:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:40] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [18:54:42] !log LDAP - added uid 'adee' to groups wmde and nda - T336434 [18:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:46] T336434: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 [18:54:48] PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:55:04] PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [18:55:04] ^ expected, brett is upgrading [18:55:09] ack, thanks [18:56:20] (03CR) 10Dzahn: "new PS compiles now as https://puppet-compiler.wmflabs.org/output/919834/41191/doc1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [19:01:34] (03CR) 10Dzahn: "I wish we wouldn't have to touch all the servers, including the prod ones, to test on new non-prod servers. .." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [19:02:35] !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5]: 0.3.124 [19:03:35] !log [WDQS Deploy] Tests passing following deploy of `0.3.124` on canary `wdqs1003`; proceeding to rest of fleet [19:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:16] (03CR) 10Dzahn: "will deploy it regardless.. but want lunch first and watch it closely" [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [19:12:40] !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5]: 0.3.124 (duration: 10m 05s) [19:12:58] !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5]: (no justification provided) [19:15:29] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet [19:15:51] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1001.eqiad.wmnet [19:18:44] !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5]: (no justification provided) (duration: 05m 46s) [19:18:47] !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5]: (no justification provided) [19:18:52] !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5]: (no justification provided) (duration: 00m 05s) [19:19:05] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T331699 (10taavi) > The easiest way I can think of to block all of those actions would be to temporarily change the uid=novaadmin user's password. We don't have anything else that would b... [19:19:48] !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5]: (no justification provided) [19:19:53] !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5]: (no justification provided) (duration: 00m 05s) [19:21:30] (03PS6) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [19:22:03] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:22:53] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet [19:23:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1001.eqiad.wmnet [19:24:56] RECOVERY - pybal on lvs5005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:25:46] RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:26:14] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:21] !log bking@deploy1002 Started deploy [wdqs/wdqs@41174d5] (wcqs): deploy 0.3.124 to WCQS [19:27:04] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:18] RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [19:28:25] !log bking@deploy1002 Finished deploy [wdqs/wdqs@41174d5] (wcqs): deploy 0.3.124 to WCQS (duration: 02m 03s) [19:28:33] (03PS7) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [19:29:03] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:29:27] (03PS8) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [19:29:57] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:32:09] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:32:56] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet [19:33:03] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1002.eqiad.wmnet [19:37:14] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1002.eqiad.wmnet [19:40:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet [19:43:31] mutante: thanks for clinic duty, I'm taking over [19:43:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10bking) Moving this out of the current work, but this is still a priority for us. Will revisit next quarter. [19:44:05] 10SRE, 10SRE-tools, 10Discovery-Search, 10Infrastructure-Foundations, 10Spicerack: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10bking) [19:47:01] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 2:00:00 on 20 hosts with reason: T335042 maintenance [19:47:06] T335042: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 [19:47:16] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 2:00:00 on 20 hosts with reason: T335042 maintenance [19:47:16] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp1003.eqiad.wmnet [19:49:13] !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086] for row D switch upgrade - bking@cumin1001 - T335042 [19:49:13] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086] for row D switch upgrade - bking@cumin1001 - T335042 [19:49:38] PROBLEM - pybal on lvs5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:50:03] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet [19:50:21] !log bking@cumin1001 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086]* for row D switch upgrade - bking@cumin1001 - T335042 [19:50:25] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic[2050-2054,2060,2067-2068,2072,2084-2086]* for row D switch upgrade - bking@cumin1001 - T335042 [19:50:34] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:53:08] PROBLEM - PyBal connections to etcd on lvs5006 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [19:54:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1003.eqiad.wmnet [19:55:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet [19:55:41] (03PS1) 10Ottomata: dse-k8s-services/mediawiki-page-content-change-enrichment - fix private helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/919885 (https://phabricator.wikimedia.org/T336656) [19:55:52] RECOVERY - pybal on lvs5006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:56:19] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:56:23] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [19:56:48] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:58:40] RECOVERY - PyBal connections to etcd on lvs5006 is OK: OK: 16 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [19:59:19] (03PS1) 10Ottomata: dse-k8s-services/mediawiki-page-content-change-enrichment - use kafka SSL [deployment-charts] - 10https://gerrit.wikimedia.org/r/919887 (https://phabricator.wikimedia.org/T331526) [19:59:34] (03CR) 10Ottomata: [V: 03+2 C: 03+2] dse-k8s-services/mediawiki-page-content-change-enrichment - use kafka SSL [deployment-charts] - 10https://gerrit.wikimedia.org/r/919887 (https://phabricator.wikimedia.org/T331526) (owner: 10Ottomata) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:24] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:00:28] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [20:02:05] (03CR) 10Ottomata: [C: 03+2] dse-k8s-services/mediawiki-page-content-change-enrichment - fix private helmfile values [deployment-charts] - 10https://gerrit.wikimedia.org/r/919885 (https://phabricator.wikimedia.org/T336656) (owner: 10Ottomata) [20:20:01] (NodeTextfileStale) firing: (2) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:20:02] 10SRE: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Legoktm) [20:20:37] (03PS1) 10Andrew Bogott: wmcs-backup: exclude non-ceph VMs from backup [puppet] - 10https://gerrit.wikimedia.org/r/919895 [20:20:39] (03PS1) 10Andrew Bogott: remove wmcs-backup-instances script, no longer used [puppet] - 10https://gerrit.wikimedia.org/r/919896 [20:25:01] (NodeTextfileStale) firing: (3) Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:29:14] cdanis: :) thanks! was trying to strike the balance there and not hand over the confusing ticket for 3 users [20:29:45] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: exclude non-ceph VMs from backup [puppet] - 10https://gerrit.wikimedia.org/r/919895 (owner: 10Andrew Bogott) [20:43:09] (03PS1) 10CDanis: admin: add adri as ldap-only [puppet] - 10https://gerrit.wikimedia.org/r/919903 (https://phabricator.wikimedia.org/T336434) [20:44:35] (03CR) 10CDanis: [C: 03+2] admin: add adri as ldap-only [puppet] - 10https://gerrit.wikimedia.org/r/919903 (https://phabricator.wikimedia.org/T336434) (owner: 10CDanis) [20:47:05] (03CR) 10Dzahn: [C: 03+1] admin: add adri as ldap-only [puppet] - 10https://gerrit.wikimedia.org/r/919903 (https://phabricator.wikimedia.org/T336434) (owner: 10CDanis) [20:49:01] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmde for Adee Ritman (WMDE) - https://phabricator.wikimedia.org/T336434 (10CDanis) 05In progress→03Resolved [20:53:36] 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Peachey88) [21:00:06] Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230515T2100). nyaa~ [21:05:14] 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Dzahn) side question: so we already have an RSS feed of the outage notifications? If so, can you paste the URL. [21:12:13] Hey all - mstyles and I have a couple of sec patches to deploy for: T335612 and T323651 [21:13:30] sukhe: no LVS work? [21:17:07] sbassett: well, it's your window on the calendar and seems quiet [21:19:05] I see nothing blocking deployment in SAL either, go for it [21:19:23] no lvs work this week, thanks for checking [21:19:28] all clear from our end [21:22:29] (we block the deployment window and also put a scap lock just to be extra sure) [21:24:24] nice! [21:25:01] 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Peachey88) It looks like they should be https://www.wikimediastatus.net/history.atom (atom) and https://www.wikimediastatus.net/history.rss (rss) and should now be enabled after {T305174} [21:31:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Jclark-ctr) [21:38:04] (03CR) 10Dzahn: [C: 03+2] doc: temporary config for docs publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [21:40:27] (03CR) 10Dzahn: [C: 03+2] "deploying one host at a time on all 4 hosts and checking what it does exactly." [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [21:44:35] (03CR) 10Dzahn: [C: 03+2] "doc2001 - noop - as designed because not active host" [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [21:48:58] (03CR) 10Dzahn: [C: 03+2] doc: temporary config for docs publishing from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [21:50:50] (03CR) 10Dzahn: [C: 03+2] "@Jaime - this is in place on doc1003, I see the new firewall rules that allow gitab IPs to connect to tcp 873 and 1873, I see the new rsyn" [puppet] - 10https://gerrit.wikimedia.org/r/919834 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [21:51:17] patch deployed for T335612 [21:51:43] !log Deployed patch for T335612 [21:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:12] 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10Dzahn) Thank you! both work for me and appear to be valid feeds :) Should they been in planet, btw? [22:02:27] !log deployed patch for T323651 [22:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:29] 10SRE, 10Infrastructure-Foundations: Advertised RSS/Atom feeds for wikimediastatus.net don't work - https://phabricator.wikimedia.org/T305174 (10Dzahn) Should I include the feed in https://en.planet.wikimedia.org/ or is it too much for that audience? [22:09:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:01] (NodeTextfileStale) firing: Stale textfile for bast2003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:13:34] (03PS1) 10BryanDavis: perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) [22:13:44] /me searches codesearch for the code of codesearch - https://codesearch.wmcloud.org/search/?q=codesearch&files=&excludeFiles=&repos= [22:14:23] but where is the config of codesearch in there? heh [22:18:22] I have found modules/codesearch/files/hound-config but that is not it. something tells it to use "gerrit-replica.wikimedia.org" for search but I can't find where that is configured [22:19:37] would like to make a patch that replaces "gerrit-replica" with "gerrit" [22:20:11] (03PS1) 10BryanDavis: perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522) [22:20:15] ah, could be that it's a cloud hiera edit [22:21:45] https://gerrit.wikimedia.org/r/plugins/gitiles/labs/codesearch/+/refs/heads/master/write_config.py#145 ? [22:23:18] dancy: oh, "labs/" ! :) thank you [22:24:01] labs will be around for a while, moving all those to /cloud/ seems a hassle [22:24:17] was it in the results of codesearch? [22:26:56] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:39] created https://gerrit.wikimedia.org/r/c/labs/codesearch/+/919925 but labs repos dont show on -operations [22:39:27] mutante: I happened to have a local checkout of that repo so I checked the git remote. [22:40:54] dancy: gotcha! thanks:) [22:43:04] (03PS5) 10BryanDavis: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel) [22:44:47] (03CR) 10BryanDavis: [C: 03+2] Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel) [22:45:19] (03Merged) 10jenkins-bot: Make "make" available in all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel) [22:47:00] (03PS2) 10BryanDavis: perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) [22:47:06] (03PS2) 10BryanDavis: perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522) [22:56:04] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:02:28] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 159 probes of 698 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:04:45] cdanis: so, I expected to be removed from topic 4 minutes ago. it's always making me a bit uncomfortable to be on that when my shift actually ended [23:04:58] I suspect a relation to earlier cloud / bot outage [23:06:12] regarding the actual on-call shift.. nothing happened. [23:06:49] is it still good form to hand-over to the next shift? which channel is best [23:09:14] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 86 probes of 698 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:09:29] well, an active ping just to say "nothing" doesn't seem that great either [23:09:56] if anyone could update the topic / restart the bot though, wpuld appreciate that. going afk, shift just ended if you look in VO [23:26:14] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service Failed on wdqs1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed