[00:03:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:12:45] (03PS1) 10Cathal Mooney: Change router advertisement template to set description correctly [homer/public] - 10https://gerrit.wikimedia.org/r/975101 (https://phabricator.wikimedia.org/T347191) [00:14:05] (03CR) 10Cathal Mooney: [C: 03+2] Change router advertisement template to set description correctly [homer/public] - 10https://gerrit.wikimedia.org/r/975101 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [00:14:40] (03Merged) 10jenkins-bot: Change router advertisement template to set description correctly [homer/public] - 10https://gerrit.wikimedia.org/r/975101 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [00:16:07] (03PS1) 10Cathal Mooney: Remove DHCP relay config for codfw row a/b public vlans [homer/public] - 10https://gerrit.wikimedia.org/r/975102 (https://phabricator.wikimedia.org/T347191) [00:17:06] (03CR) 10Cathal Mooney: [C: 03+2] Remove DHCP relay config for codfw row a/b public vlans [homer/public] - 10https://gerrit.wikimedia.org/r/975102 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [00:17:44] (03Merged) 10jenkins-bot: Remove DHCP relay config for codfw row a/b public vlans [homer/public] - 10https://gerrit.wikimedia.org/r/975102 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [00:18:32] (03CR) 10Krinkle: Enable $wgStatsTarget for requests to kube-mw-debug (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite) [00:26:16] (03PS4) 10Tim Starling: Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) [00:26:53] (03PS6) 10Tim Starling: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) [00:32:51] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1157'] [00:39:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1157'] [00:39:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974636 [00:39:14] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974636 (owner: 10TrainBranchBot) [00:50:24] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1157.eqiad.wmnet with OS bullseye [00:50:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [00:55:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-worker1158'] [00:59:05] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/974636 (owner: 10TrainBranchBot) [01:00:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['an-worker1158'] [01:12:01] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1158.eqiad.wmnet with OS bullseye [01:12:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [01:14:49] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) P53530 [01:25:11] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) Here's the change in errors on /dev/sdj since the 31st. ` 4c4 < (1) cloudcephosd1024.eqiad.wmnet 198 Of... [01:58:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:38:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:21] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:24:40] (03PS1) 10MPGuy2824: Disable PageTriage's extended features on beta testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975107 (https://phabricator.wikimedia.org/T349635) [03:59:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53531 and previous config saved to /var/cache/conftool/dbconfig/20231117-035924-arnaudb.json [03:59:32] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [04:03:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:14:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P53532 and previous config saved to /var/cache/conftool/dbconfig/20231117-041430-arnaudb.json [04:29:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P53533 and previous config saved to /var/cache/conftool/dbconfig/20231117-042937-arnaudb.json [04:44:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T348183)', diff saved to https://phabricator.wikimedia.org/P53534 and previous config saved to /var/cache/conftool/dbconfig/20231117-044443-arnaudb.json [04:44:45] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [04:44:48] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [04:44:58] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [04:45:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53535 and previous config saved to /var/cache/conftool/dbconfig/20231117-044504-arnaudb.json [04:53:15] (03CR) 10Zoranzoki21: [C: 04-1] ""groups/Phabricator/Phabricator.yaml" in translatewiki.net repository has to be updated as well" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/974717 (https://phabricator.wikimedia.org/T299694) (owner: 10Pppery) [05:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 5.838285390921219s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:47:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.9822750948484793s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:58:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:48:21] !log mabualruz@deploy2002 Backport cancelled. [06:48:28] (03PS1) 10Marostegui: mariadb: Move db1119 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/975117 (https://phabricator.wikimedia.org/T351386) [06:49:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1119 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/975117 (https://phabricator.wikimedia.org/T351386) (owner: 10Marostegui) [06:54:30] (03PS1) 10Marostegui: db2133: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/975118 (https://phabricator.wikimedia.org/T351386) [06:55:32] (03CR) 10Marostegui: [C: 03+2] db2133: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/975118 (https://phabricator.wikimedia.org/T351386) (owner: 10Marostegui) [06:55:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2133.codfw.wmnet with OS bookworm [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231117T0700) [07:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:12:30] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:12:37] ^ expected [07:12:58] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [07:13:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2133.codfw.wmnet with reason: host reimage [07:16:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2133.codfw.wmnet with reason: host reimage [07:19:33] https://www.irccloud.com/pastebin/Zl0lzsiz [07:20:44] Good morning I have some trouble back porting details in snippet above [07:30:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2133.codfw.wmnet with OS bookworm [07:31:15] jouncebot: nowandnext [07:31:15] For the next 0 hour(s) and 28 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231117T0700) [07:31:15] In 0 hour(s) and 28 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231117T0800) [07:31:26] mo_abualruz: why are you backporting? [07:31:38] It's Friday. Has approval for an emergency deploy been sought? [07:34:43] !log jmm@cumin1001 START - Cookbook sre.ganeti.resource-report [07:34:44] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [07:34:49] !log jmm@cumin1001 START - Cookbook sre.ganeti.resource-report [07:34:49] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [07:35:16] (03PS1) 10Muehlenhoff: Adapt VM name [puppet] - 10https://gerrit.wikimedia.org/r/975122 (https://phabricator.wikimedia.org/T349402) [07:38:14] (03CR) 10Muehlenhoff: [C: 03+2] Adapt VM name [puppet] - 10https://gerrit.wikimedia.org/r/975122 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [07:41:54] (03PS1) 10Muehlenhoff: Switch moss nodes to role::insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619) [07:43:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [07:44:19] RhinosF1: Not sure about the workflow on Friday my team have requested of me to backport it, there is a high number of front end errors because of it [07:45:11] (03PS4) 10Slyngshede: P:url_downloader add blackbox exporter. [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) [07:47:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/532/console" [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:47:28] (03PS1) 10Muehlenhoff: Switch debmonitor2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975124 (https://phabricator.wikimedia.org/T349619) [07:49:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host debmonitor2003.codfw.wmnet [07:50:10] (03CR) 10Muehlenhoff: [C: 03+2] Switch debmonitor2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975124 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:50:16] (03PS2) 10Muehlenhoff: Switch debmonitor2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975124 (https://phabricator.wikimedia.org/T349619) [07:57:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host debmonitor2003.codfw.wmnet [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231117T0800) [08:00:48] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:01:21] mo_abualruz: you still need SRE/Releng approval [08:01:51] moritzm: you seem to be around? We've got a request for an emergency deploy [08:02:21] Sure where to submit a request [08:02:48] mo_abualruz: you ask in here and -releng, I've done that [08:03:15] Hopefully releng can also look at the error you got [08:03:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:03:37] thanks a lot [08:04:22] I'm around, but let's rather wait for one of the releng folks to be around (hashar or jnuche), not sure on which basis those exceptions handled [08:05:10] moritzm: I've pinged both in -releng, someone from SRE is supposed to say it's ok too I believe. We'll need them for the fact mo_abualruz couldn't work scap either. [08:05:33] mo_abualruz: it will be at least an hour for Jamie, not sure about has.har [08:05:49] !log jmm@cumin1001 START - Cookbook sre.ganeti.resource-report [08:05:50] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [08:06:34] No worries I will wait thanks a lot RhinosF1 [08:06:58] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:07:31] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) [08:08:03] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) [08:09:59] !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm for new host crm2001.codfw.wmnet [08:10:00] !log jmm@cumin1001 START - Cookbook sre.dns.netbox [08:13:27] !log jmm@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM crm2001.codfw.wmnet - jmm@cumin1001" [08:14:18] !log jmm@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM crm2001.codfw.wmnet - jmm@cumin1001" [08:14:18] !log jmm@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:14:18] !log jmm@cumin1001 START - Cookbook sre.dns.wipe-cache crm2001.codfw.wmnet on all recursors [08:14:21] !log jmm@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) crm2001.codfw.wmnet on all recursors [08:14:48] !log jmm@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM crm2001.codfw.wmnet - jmm@cumin1001" [08:15:39] !log jmm@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM crm2001.codfw.wmnet - jmm@cumin1001" [08:18:29] (03PS1) 10Muehlenhoff: Configure crm2001 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975202 (https://phabricator.wikimedia.org/T349402) [08:19:16] PROBLEM - Check systemd state on kubernetes2055 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:25] (03CR) 10Muehlenhoff: [C: 03+2] Configure crm2001 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975202 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [08:25:18] !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host crm2001.codfw.wmnet with OS bookworm [08:25:30] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host crm2001.codfw.wmnet with OS... [08:29:00] (03CR) 10Stevemunene: [C: 03+2] Add dummy keytabs for new druid101[0-1] [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [08:29:11] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add dummy keytabs for new druid101[0-1] [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [08:30:13] !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [08:36:32] !disable puppet on dbprov2001 for testing T351491 [08:36:32] T351491: pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'db1164.eqiad.wmnet' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))") on backup - https://phabricator.wikimedia.org/T351491 [08:42:42] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on crm2001.codfw.wmnet with reason: host reimage [08:45:41] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on crm2001.codfw.wmnet with reason: host reimage [08:48:29] (03CR) 10WMDE-Fisch: "> 18:51:32 map-bmswiki is referenced for wgPopupsConflictingNavPopupsGadgetName, but it isn't either a wiki or a dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [08:57:42] o/ [08:58:04] (03PS1) 10Ilias Sarantopoulos: ml-services: rollback xgboost/catboost models to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975205 (https://phabricator.wikimedia.org/T347551) [08:58:26] mo_abualruz: sorry I only opened IRC a couple minutes ago [08:59:16] > I try scap backport 975096 I get ```fatal: cannot change to '/srv/mediawiki-staging/php-master': No such file or directory [08:59:16] > 07:16:21 backport failed: Command '['git', '-C', '/srv/mediawiki-staging/php-master', 'rev-list', 'origin/master', '--regexp-ignore-case', '--grep', 'Change-Id: Ie417d62484192f1b9ac270b1e619ec783da89d9d']' returned non-zero exit status 128. [08:59:24] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:00:01] that `php-master` link is for the beta cluster which runs mediawiki out of the master branches [09:00:10] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:00:11] there is ZERO reason for it to exist on production [09:01:17] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host crm2001.codfw.wmnet with OS bookworm [09:01:17] !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host crm2001.codfw.wmnet [09:01:22] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host crm2001.codfw.wmnet with OS bookworm completed: - crm... [09:01:37] my bet is something got changed in puppet/config iwhich feeds the wrong value [09:02:09] hashar: mo_abualruz: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/975096/ is against the master branch, not a wmf release branch [09:02:29] that is the point of running `scap-backport` [09:04:37] no, scap backport has never created the cherry-picks automatically [09:04:58] !log imported php-memcached 3.1.5+2.2.0-5+deb11u1+wmf1+bullseye1 to component/php74 for bullseye-wikimedia [09:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:37] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [09:16:22] RECOVERY - Check systemd state on kubernetes2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:18] (03PS1) 10Brouberol: Replace an-druid1001 by an-druid1001 in druid connection strings [puppet] - 10https://gerrit.wikimedia.org/r/975207 (https://phabricator.wikimedia.org/T332604) [09:18:50] (03PS2) 10Brouberol: Replace an-druid1001 by an-druid1002 in druid connection strings [puppet] - 10https://gerrit.wikimedia.org/r/975207 (https://phabricator.wikimedia.org/T332604) [09:22:51] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade GitLab Replica gitlab1003 with new runners [09:24:03] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Test Upgrade GitLab Replica gitlab1003 with new runners [09:25:24] (03CR) 10Brouberol: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/974164 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [09:26:56] (03CR) 10Brouberol: [C: 03+1] Set a non-default mapreduce file committer algorithm for spark [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis) [09:31:14] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-etcd1003.eqiad.wmnet [09:32:35] (03PS2) 10Brouberol: Configure Matomo's TagManager to write to existing tmpdir [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [09:34:18] (03CR) 10Brouberol: "I fixed the tests" [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [09:34:20] (03CR) 10Klausman: [C: 03+2] hiera: migrate ml-etcd1003.eqiad.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975208 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [09:34:46] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/975207 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [09:36:40] (03CR) 10Btullis: Configure Matomo's TagManager to write to existing tmpdir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [09:38:05] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-etcd1003.eqiad.wmnet [09:39:24] (03CR) 10Brouberol: [C: 03+2] Replace an-druid1001 by an-druid1002 in druid connection strings [puppet] - 10https://gerrit.wikimedia.org/r/975207 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [09:41:02] (03PS1) 10Muehlenhoff: Create a new initial role for crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402) [09:41:31] (03CR) 10CI reject: [V: 04-1] Create a new initial role for crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [09:43:17] (03PS2) 10Muehlenhoff: Create a new initial role for crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402) [09:44:10] (03PS1) 10Volans: remote: add RemoteHost.get_subset() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/975211 [09:44:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53537 and previous config saved to /var/cache/conftool/dbconfig/20231117-094412-arnaudb.json [09:44:17] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [09:45:01] (03CR) 10Volans: [C: 04-1] "I've sent a separate CR that should allow to simplify a bit this one and be a bit less hacky ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond) [09:50:59] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/535/con" [puppet] - 10https://gerrit.wikimedia.org/r/975210 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [09:51:51] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-etcd1002.eqiad.wmnet [09:55:30] (03PS1) 10Elukey: Remove ORES roles and configs [puppet] - 10https://gerrit.wikimedia.org/r/975213 (https://phabricator.wikimedia.org/T347278) [09:55:32] (03PS1) 10Elukey: profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278) [09:55:34] (03PS1) 10Elukey: Remove ORES deployment settings [puppet] - 10https://gerrit.wikimedia.org/r/975215 (https://phabricator.wikimedia.org/T347278) [09:55:36] (03PS1) 10Elukey: Remove ORES configs and clusters [puppet] - 10https://gerrit.wikimedia.org/r/975216 (https://phabricator.wikimedia.org/T347278) [09:55:38] (03PS1) 10Elukey: profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278) [09:55:40] (03PS1) 10Elukey: cloud: Remove ores-beta ATS settings [puppet] - 10https://gerrit.wikimedia.org/r/975218 (https://phabricator.wikimedia.org/T347278) [09:55:42] (03PS1) 10Elukey: admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) [09:55:44] (03PS1) 10Elukey: contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278) [09:58:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:59:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P53539 and previous config saved to /var/cache/conftool/dbconfig/20231117-095918-arnaudb.json [10:01:30] (03PS1) 10Majavah: team-wmcs: restrict alerts to eqiad for now [alerts] - 10https://gerrit.wikimedia.org/r/975222 (https://phabricator.wikimedia.org/T350010) [10:03:39] (03PS1) 10JMeybohm: Normalize config/sites.yaml to be machine editable [homer/public] - 10https://gerrit.wikimedia.org/r/975224 (https://phabricator.wikimedia.org/T351074) [10:03:43] (03PS1) 10JMeybohm: Move mw appservers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/975225 (https://phabricator.wikimedia.org/T351074) [10:03:47] (03PS1) 10JMeybohm: Normalize conftool-data/node/{eqiad,codfw}.yaml to be machine editable [puppet] - 10https://gerrit.wikimedia.org/r/975227 (https://phabricator.wikimedia.org/T351074) [10:03:49] (03PS1) 10JMeybohm: Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074) [10:03:56] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [10:04:24] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10MatthewVernon) [10:04:33] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: migrate ml-etcd*.eqiad.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975210 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [10:05:40] (03CR) 10Muehlenhoff: [C: 03+2] Create a new initial role for crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975209 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [10:05:55] (03CR) 10Elukey: [C: 03+1] "Should we also set the repository in read-only and/or archive? No idea what is the procedure from gerrit to gitlab, if the old gerrit repo" [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (owner: 10BCornwall) [10:07:41] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10MatthewVernon) @thcipriani you're the approver for the `deployment` group, can you approve (or otherwise) this request, please? [10:08:40] (03PS1) 10Muehlenhoff: Apply crm role to crm2001 [puppet] - 10https://gerrit.wikimedia.org/r/975229 (https://phabricator.wikimedia.org/T349402) [10:08:59] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-etcd1002.eqiad.wmnet [10:09:44] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-etcd1001.eqiad.wmnet [10:10:14] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10MatthewVernon) [10:11:12] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for sfaci - https://phabricator.wikimedia.org/T351431 (10MatthewVernon) ssh pubkey confirmed OOB; this just needs group approval. [10:12:46] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-etcd1001.eqiad.wmnet [10:12:48] !log jmm@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new crm VM - jmm@cumin1001 - T349402" [10:12:54] T349402: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 [10:13:17] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Disable the daily updates job and schedule an import [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders) [10:13:52] (03CR) 10Kosta Harlan: [C: 03+2] "I'll deploy this one now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders) [10:14:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P53540 and previous config saved to /var/cache/conftool/dbconfig/20231117-101425-arnaudb.json [10:14:54] (03PS1) 10Jcrespo: dbbackups: Update mysql CA for content and metadata backups [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) [10:15:00] (03Merged) 10jenkins-bot: ipoid: Disable the daily updates job and schedule an import [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders) [10:15:24] (03PS1) 10Jbond: sre.puppet.migrate-*: allow some steps to fail [cookbooks] - 10https://gerrit.wikimedia.org/r/975232 [10:15:26] (03CR) 10Tchanders: ipoid: Disable the daily updates job and schedule an import (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/975061 (https://phabricator.wikimedia.org/T351449) (owner: 10Tchanders) [10:15:56] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:16:01] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:16:13] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:16:17] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:17:51] !log jmm@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new crm VM - jmm@cumin1001 - T349402" [10:17:55] T349402: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 [10:18:14] (03CR) 10Muehlenhoff: [C: 03+2] Apply crm role to crm2001 [puppet] - 10https://gerrit.wikimedia.org/r/975229 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [10:18:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/975211 (owner: 10Volans) [10:18:44] (03CR) 10Jcrespo: "The puppet side of the change for the ca update." [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo) [10:19:12] (03CR) 10Jbond: [C: 03+1] admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:19:24] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:19:27] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:19:54] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: migrate ml-serve-ctrl*.eqiad.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975233 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [10:19:59] (03PS3) 10Jbond: puppet: update gat_ca_server to also suport srv discovry [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 [10:20:14] (03CR) 10Arnaudb: [C: 03+1] dbbackups: Update mysql CA for content and metadata backups [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo) [10:20:30] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve-ctrl1002.eqiad.wmnet [10:22:01] (03PS1) 10Muehlenhoff: Extend Wmflib::Team type with Fundraising Tech [puppet] - 10https://gerrit.wikimedia.org/r/975234 (https://phabricator.wikimedia.org/T349402) [10:23:14] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve-ctrl1002.eqiad.wmnet [10:25:29] (03PS1) 10Punith.nyk: Switch mariadb::core to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975235 [10:27:12] (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also suport srv discovry [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (owner: 10Jbond) [10:28:10] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve-ctrl1001.eqiad.wmnet [10:29:27] (03CR) 10Marostegui: [C: 03+1] dbbackups: Update mysql CA for content and metadata backups [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo) [10:29:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T348183)', diff saved to https://phabricator.wikimedia.org/P53541 and previous config saved to /var/cache/conftool/dbconfig/20231117-102931-arnaudb.json [10:29:33] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [10:29:36] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [10:29:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [10:29:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T348183)', diff saved to https://phabricator.wikimedia.org/P53542 and previous config saved to /var/cache/conftool/dbconfig/20231117-102952-arnaudb.json [10:31:24] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve-ctrl1001.eqiad.wmnet [10:31:41] (03CR) 10Muehlenhoff: [C: 03+2] Extend Wmflib::Team type with Fundraising Tech [puppet] - 10https://gerrit.wikimedia.org/r/975234 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [10:32:25] (03CR) 10Jcrespo: "Should work ok, although cloud hiera is showing the trivial (non existent) output: https://puppet-compiler.wmflabs.org/output/975231/536/b" [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo) [10:32:38] hashar: Thanks I will cherrypick into a patch against release branch [10:34:03] (03PS1) 10Mabualruz: Fixes AMC outreach drawer [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) [10:43:54] (03CR) 10Thiemo Kreuz (WMDE): Update the list of ReferenceTooltip gadget names (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974984 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [10:44:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:48:53] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/975238 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman) [10:50:00] (03PS1) 10Kosta Harlan: ipoid: Disable cronjob in eqiad-specific config [deployment-charts] - 10https://gerrit.wikimedia.org/r/975240 (https://phabricator.wikimedia.org/T351449) [10:50:08] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Disable cronjob in eqiad-specific config [deployment-charts] - 10https://gerrit.wikimedia.org/r/975240 (https://phabricator.wikimedia.org/T351449) (owner: 10Kosta Harlan) [10:50:59] (03Merged) 10jenkins-bot: ipoid: Disable cronjob in eqiad-specific config [deployment-charts] - 10https://gerrit.wikimedia.org/r/975240 (https://phabricator.wikimedia.org/T351449) (owner: 10Kosta Harlan) [10:51:17] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1008.eqiad.wmnet [10:52:19] (03CR) 10Klausman: "This should be coordinated with the data persistence team (of which I am not a member). You can find them on IRC (Libera) in #wikimedia-da" [puppet] - 10https://gerrit.wikimedia.org/r/975235 (owner: 10Punith.nyk) [10:52:38] (03CR) 10CI reject: [V: 04-1] Fixes AMC outreach drawer [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz) [10:52:43] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:53:02] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:53:28] (03CR) 10Mabualruz: "recheck" [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz) [10:54:22] (03CR) 10Klausman: [C: 03+1] profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:54:32] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1008.eqiad.wmnet [10:54:47] (03CR) 10Klausman: [C: 03+1] profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:55:40] (03CR) 10Klausman: [C: 03+1] Remove ORES roles and configs [puppet] - 10https://gerrit.wikimedia.org/r/975213 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:56:02] (03PS1) 10Muehlenhoff: Create a new crm-root group and apply to crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) [10:56:33] (03CR) 10Klausman: [C: 03+1] Remove ORES deployment settings [puppet] - 10https://gerrit.wikimedia.org/r/975215 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:56:37] (03CR) 10CI reject: [V: 04-1] Create a new crm-root group and apply to crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [10:57:12] (03CR) 10Klausman: "I presume the prod-site data (current state) will be automagically removed when this is submitted?" [puppet] - 10https://gerrit.wikimedia.org/r/975216 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:58:13] (03CR) 10Klausman: [C: 03+1] admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:58:28] (03CR) 10Klausman: [C: 03+1] contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [10:59:55] (03PS1) 10Clément Goubert: mediawiki: Fix rsyslog rule again [deployment-charts] - 10https://gerrit.wikimedia.org/r/975246 (https://phabricator.wikimedia.org/T350430) [11:00:41] (03CR) 10Hashar: [C: 03+1] Fixes AMC outreach drawer [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz) [11:00:43] (03PS2) 10Muehlenhoff: Create a new crm-root group and apply to crm hosts [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) [11:03:26] (03CR) 10Elukey: [C: 03+2] Remove ORES roles and configs [puppet] - 10https://gerrit.wikimedia.org/r/975213 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [11:04:18] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:04:24] (03CR) 10Jcrespo: "Hello, Punith, thank you for your contribution, but deploying a change that can affect TLS on MySQL production servers is something that c" [puppet] - 10https://gerrit.wikimedia.org/r/975235 (owner: 10Punith.nyk) [11:04:38] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix rsyslog rule again [deployment-charts] - 10https://gerrit.wikimedia.org/r/975246 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert) [11:06:16] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:06:38] (03Merged) 10jenkins-bot: mediawiki: Fix rsyslog rule again [deployment-charts] - 10https://gerrit.wikimedia.org/r/975246 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert) [11:07:01] !log Redeploying mw-on-k8s for T350430 [11:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:07] T350430: php-fpm logs from Kubernetes lack 'message' and 'normalized_message' - https://phabricator.wikimedia.org/T350430 [11:08:15] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:08:17] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:08:19] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:08:22] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:08:23] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:08:24] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:08:25] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:08:34] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:08:37] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:08:38] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:08:41] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:08:42] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:08:44] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:08:45] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:08:45] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [11:08:45] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [11:10:04] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10media-backups: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) [11:10:08] !log running schema change on backup1-eqiad (mediabackups) T191804 [11:10:10] (03Abandoned) 10Punith.nyk: Switch mariadb::core to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975235 (owner: 10Punith.nyk) [11:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:12] T191804: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 [11:10:47] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:11:23] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:11:25] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:11:49] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests, 10Patch-For-Review: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff crm2001.codfw.wmnet has been created and configured to al... [11:11:51] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:11:53] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:12:22] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:12:24] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:12:59] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:13:00] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:13:28] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:13:29] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:13:29] I will start deployment for 975037 as now the change is against release branch, thanks for the directions and the approval [11:13:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mabualruz@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz) [11:14:13] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:14:14] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:14:52] (03PS5) 10Hnowlan: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 [11:14:53] :) [11:14:54] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:14:55] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:15:36] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:15:44] !log cgoubert@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [11:15:45] !log cgoubert@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [11:15:55] !log cgoubert@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:15:55] !log cgoubert@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:16:01] !log cgoubert@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [11:16:01] !log cgoubert@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [11:16:14] !log cgoubert@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:16:14] !log cgoubert@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:16:15] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [11:16:34] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [11:16:35] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [11:16:58] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [11:17:00] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [11:17:16] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [11:17:18] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [11:17:38] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [11:20:18] (03CR) 10MVernon: swift: migrate one node to envoy for TLS termination (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [11:20:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/975232 (owner: 10Jbond) [11:20:45] (03PS1) 10Stevemunene: switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 [11:20:46] !log running schema change on backup1-codfw (mediabackups) T191804 [11:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:51] T191804: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 [11:27:59] (PuppetZeroResources) firing: Puppet has failed generate resources on kubestage2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:28:21] (03Merged) 10jenkins-bot: Fixes AMC outreach drawer [extensions/MobileFrontend] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975037 (https://phabricator.wikimedia.org/T351362) (owner: 10Mabualruz) [11:28:36] !log mabualruz@deploy2002 Started scap: Backport for [[gerrit:975037|Fixes AMC outreach drawer (T351362)]] [11:28:40] T351362: Regression: AMC Outreach campaign is not showing when mobile users click desktop link - https://phabricator.wikimedia.org/T351362 [11:29:26] (03PS1) 10Muehlenhoff: Remove obsolete config [puppet] - 10https://gerrit.wikimedia.org/r/975249 [11:29:55] !log mabualruz@deploy2002 mabualruz: Backport for [[gerrit:975037|Fixes AMC outreach drawer (T351362)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:30:18] !log mabualruz@deploy2002 mabualruz: Continuing with sync [11:36:08] !log mabualruz@deploy2002 Finished scap: Backport for [[gerrit:975037|Fixes AMC outreach drawer (T351362)]] (duration: 07m 32s) [11:36:12] T351362: Regression: AMC Outreach campaign is not showing when mobile users click desktop link - https://phabricator.wikimedia.org/T351362 [11:36:16] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:37:12] Thanks a lot deployment is successful [11:39:15] (03PS1) 10Muehlenhoff: Remove Hiera setting on an-worker1111 [puppet] - 10https://gerrit.wikimedia.org/r/975251 [11:39:18] (KubernetesCalicoDown) resolved: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:42:18] 10SRE, 10Infrastructure-Foundations, 10netops: Support Anycast GW on EVPN switches without unique IP - https://phabricator.wikimedia.org/T350579 (10cmooney) 05Open→03Resolved Patches to support this have been merged and it's working for the codfw row A/B public vlans, closing task. [11:46:31] (03PS5) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) [11:47:00] (03CR) 10CI reject: [V: 04-1] Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [11:47:37] (03CR) 10MVernon: [C: 04-1] "Hi," [software/transferpy] - 10https://gerrit.wikimedia.org/r/974986 (owner: 10Jcrespo) [11:47:47] (03PS2) 10Muehlenhoff: Remove Hiera setting on an-worker1111 [puppet] - 10https://gerrit.wikimedia.org/r/975251 [11:48:24] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) [11:48:55] (03PS6) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) [11:49:41] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) [11:53:21] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) [11:54:05] (03PS2) 10Kosta Harlan: ipoid: Add DATADIR environment variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500) [11:54:10] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Add DATADIR environment variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500) (owner: 10Kosta Harlan) [11:54:20] (03PS7) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) [11:54:31] (03CR) 10MVernon: "What practical effect does this have?" [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:54:38] mo_abualruz: congratulations :) [11:54:44] (03PS1) 10Muehlenhoff: Remove obsolete Hiera file [puppet] - 10https://gerrit.wikimedia.org/r/975252 [11:55:01] (03Merged) 10jenkins-bot: ipoid: Add DATADIR environment variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500) (owner: 10Kosta Harlan) [11:55:14] (03CR) 10Jbond: [C: 03+2] sre.puppet.migrate-*: allow some steps to fail [cookbooks] - 10https://gerrit.wikimedia.org/r/975232 (owner: 10Jbond) [11:55:16] hashar thanks [11:55:24] (03PS8) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) [11:55:31] * hashar lunches [11:59:14] (03Merged) 10jenkins-bot: sre.puppet.migrate-*: allow some steps to fail [cookbooks] - 10https://gerrit.wikimedia.org/r/975232 (owner: 10Jbond) [11:59:40] (03PS4) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [11:59:42] (03PS5) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) [12:01:48] (03PS2) 10Stevemunene: switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043) [12:02:21] (03CR) 10CI reject: [V: 04-1] switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [12:03:08] (03PS3) 10Stevemunene: switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043) [12:03:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:04:44] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:06:19] (03CR) 10Muehlenhoff: Switch moss nodes to role::insetup::buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:06:36] (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [12:07:11] (03CR) 10MVernon: [C: 03+1] Switch moss nodes to role::insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:09:46] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [12:09:54] (03PS5) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [12:10:08] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [12:10:10] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) public1-a-codfw and public1-b-codfw have gateways have been migrated to the new setup. **Problems** Unfortu... [12:10:18] (03PS1) 10Muehlenhoff: Also configure acmechief hosts for initially migrated roles [puppet] - 10https://gerrit.wikimedia.org/r/975254 [12:10:56] (03CR) 10Kosta Harlan: [C: 03+2] "Deployed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974939 (https://phabricator.wikimedia.org/T350500) (owner: 10Kosta Harlan) [12:11:08] (03CR) 10Muehlenhoff: [C: 03+2] Switch moss nodes to role::insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/975123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:15:19] (03PS6) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [12:18:58] (03PS1) 10Majavah: O:puppetserver: create role for per-project puppet server [puppet] - 10https://gerrit.wikimedia.org/r/975256 (https://phabricator.wikimedia.org/T351452) [12:19:00] (03PS1) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 [12:19:15] (03PS7) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [12:20:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/546/console" [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah) [12:23:48] (03PS2) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 [12:23:50] (03PS1) 10JMeybohm: k8s: Make kubelet register new nodes as unschedulable [puppet] - 10https://gerrit.wikimedia.org/r/975258 [12:24:18] (03CR) 10CI reject: [V: 04-1] P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah) [12:24:20] (03Abandoned) 10JMeybohm: k8s: Make kubelet register new nodes as unschedulable [puppet] - 10https://gerrit.wikimedia.org/r/974615 (owner: 10JMeybohm) [12:24:51] (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [12:25:02] (03PS8) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [12:25:18] (03PS3) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 [12:25:49] (03CR) 10CI reject: [V: 04-1] P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah) [12:27:17] (03PS4) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 [12:27:44] (03CR) 10JMeybohm: "I've tested this in staging-codfw. The node is created with the proper taint and unschedulable: true flag. Both of which are not reset on " [puppet] - 10https://gerrit.wikimedia.org/r/975258 (owner: 10JMeybohm) [12:27:49] (03CR) 10CI reject: [V: 04-1] P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah) [12:29:03] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/550/con" [puppet] - 10https://gerrit.wikimedia.org/r/975258 (owner: 10JMeybohm) [12:30:23] (03PS5) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 [12:31:45] (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [12:34:16] (03CR) 10JMeybohm: [C: 03+1] Remove obsolete config [puppet] - 10https://gerrit.wikimedia.org/r/975249 (owner: 10Muehlenhoff) [12:34:43] (03PS1) 10Muehlenhoff: Switch ldap-rw1001/2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975259 (https://phabricator.wikimedia.org/T349619) [12:35:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete config [puppet] - 10https://gerrit.wikimedia.org/r/975249 (owner: 10Muehlenhoff) [12:39:52] (03PS9) 10D3r1ck01: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) [12:40:15] (03CR) 10D3r1ck01: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [12:42:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ldap-rw1001.wikimedia.org [12:45:45] (03CR) 10Muehlenhoff: [C: 03+2] Switch ldap-rw1001/2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975259 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:45:54] (03PS2) 10Muehlenhoff: Switch ldap-rw1001/2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/975259 (https://phabricator.wikimedia.org/T349619) [12:45:59] (PuppetFailure) firing: Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:47:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/975225 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [12:47:59] (PuppetFailure) firing: Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:49:46] (03CR) 10Brouberol: [C: 03+1] switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/975248 (https://phabricator.wikimedia.org/T336043) (owner: 10Stevemunene) [12:50:31] (03CR) 10Cathal Mooney: [C: 03+1] "Yep anything that helps!" [homer/public] - 10https://gerrit.wikimedia.org/r/975224 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [12:50:51] !log joal@deploy2002 Started deploy [airflow-dags/analytics@a5e5ddc]: Airflow HOTFIX [airflow-dags/analytics@a5e5ddca] [12:50:59] (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:51:19] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@a5e5ddc]: Airflow HOTFIX [airflow-dags/analytics@a5e5ddca] (duration: 00m 28s) [12:52:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ldap-rw1001.wikimedia.org [12:52:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ldap-rw2001.wikimedia.org [12:52:59] (PuppetFailure) firing: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:52:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on kubestage2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:53:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ldap-rw2001.wikimedia.org [12:53:39] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ldap-rw2001.wikimedia.org [12:54:18] elukey: puppet on deploy nodes is failing with /Stage[main]/Profile::Httpbb/Httpbb::Test_suite[ores/test_ores.yaml]/File[/srv/deployment/httpbb-tests/ores/test_ores.yaml] Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/profile/httpbb/ores/test_ores.yaml [12:54:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ldap-rw2001.wikimedia.org [12:55:24] (03PS1) 10Muehlenhoff: Temporarily revert change for ldap-rw2001 [puppet] - 10https://gerrit.wikimedia.org/r/975262 [12:56:00] (03CR) 10Majavah: [C: 04-1] "the directory should be created by g10k and owned by root" [puppet] - 10https://gerrit.wikimedia.org/r/975089 (https://phabricator.wikimedia.org/T351468) (owner: 10Andrew Bogott) [12:58:04] (03CR) 10Muehlenhoff: [C: 03+2] Temporarily revert change for ldap-rw2001 [puppet] - 10https://gerrit.wikimedia.org/r/975262 (owner: 10Muehlenhoff) [13:03:29] (03CR) 10Jbond: "I have tested this using the script at https://phabricator.wikimedia.org/P53543 and it produced the following results (i manualy updated t" [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [13:06:05] (03CR) 10Jbond: puppet: update gat_ca_server to also support srv discovery (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [13:07:18] (03PS4) 10D3r1ck01: wmf-config: Remove StatsCacheType (unused) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974508 (https://phabricator.wikimedia.org/T336004) [13:08:46] (03PS1) 10Jbond: Revert "Temporarily revert change for ldap-rw2001" [puppet] - 10https://gerrit.wikimedia.org/r/975038 [13:08:55] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Temporarily revert change for ldap-rw2001" [puppet] - 10https://gerrit.wikimedia.org/r/975038 (owner: 10Jbond) [13:11:14] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:58] (03CR) 10Filippo Giunchedi: [C: 03+1] team-wmcs: restrict alerts to eqiad for now [alerts] - 10https://gerrit.wikimedia.org/r/975222 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah) [13:13:37] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:13:45] (03CR) 10Majavah: [C: 03+2] team-wmcs: restrict alerts to eqiad for now [alerts] - 10https://gerrit.wikimedia.org/r/975222 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah) [13:13:55] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:15:02] (03Merged) 10jenkins-bot: team-wmcs: restrict alerts to eqiad for now [alerts] - 10https://gerrit.wikimedia.org/r/975222 (https://phabricator.wikimedia.org/T350010) (owner: 10Majavah) [13:16:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975254 (owner: 10Muehlenhoff) [13:16:08] (03PS1) 10Muehlenhoff: Cleanup obsolete Hiera files [puppet] - 10https://gerrit.wikimedia.org/r/975265 [13:18:24] (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [13:20:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975256 (https://phabricator.wikimedia.org/T351452) (owner: 10Majavah) [13:20:32] (03CR) 10Filippo Giunchedi: "LGTM, cosmetic comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [13:22:51] (03PS2) 10Elukey: profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278) [13:22:53] (03PS2) 10Elukey: Remove ORES deployment settings [puppet] - 10https://gerrit.wikimedia.org/r/975215 (https://phabricator.wikimedia.org/T347278) [13:22:55] (03PS2) 10Elukey: Remove ORES configs and clusters [puppet] - 10https://gerrit.wikimedia.org/r/975216 (https://phabricator.wikimedia.org/T347278) [13:22:57] (03PS2) 10Elukey: profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278) [13:22:59] (03PS2) 10Elukey: cloud: Remove ores-beta ATS settings [puppet] - 10https://gerrit.wikimedia.org/r/975218 (https://phabricator.wikimedia.org/T347278) [13:23:01] (03PS2) 10Elukey: admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) [13:23:03] (03PS2) 10Elukey: contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278) [13:23:05] (03PS1) 10Elukey: profile::httpbb: remove ores_test configs [puppet] - 10https://gerrit.wikimedia.org/r/975267 (https://phabricator.wikimedia.org/T347278) [13:23:33] (03CR) 10Klausman: [C: 03+1] profile::httpbb: remove ores_test configs [puppet] - 10https://gerrit.wikimedia.org/r/975267 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:24:32] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1007.eqiad.wmnet [13:26:12] (03CR) 10Elukey: [C: 03+1] k8s: Make kubelet register new nodes as unschedulable [puppet] - 10https://gerrit.wikimedia.org/r/975258 (owner: 10JMeybohm) [13:26:51] (03PS1) 10Vgutierrez: wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/975268 (https://phabricator.wikimedia.org/T351069) [13:26:53] (03Abandoned) 10Btullis: Set a non-default mapreduce file committer algorithm for spark [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis) [13:26:56] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1007.eqiad.wmnet [13:27:44] (03Abandoned) 10Vgutierrez: wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/975268 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [13:28:29] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1006.eqiad.wmnet [13:28:49] (03CR) 10Vgutierrez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [13:29:19] (03CR) 10Elukey: [C: 03+2] profile::httpbb: remove ores_test configs [puppet] - 10https://gerrit.wikimedia.org/r/975267 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:30:16] (03CR) 10Vgutierrez: "PS2 PCC https://puppet-compiler.wmflabs.org/output/974623/491/ shows a working example for ncredir6001 after setting ipip_encapsulation: t" [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [13:30:18] (03CR) 10Elukey: [C: 03+2] profile::logstash: remove ORES configs [puppet] - 10https://gerrit.wikimedia.org/r/975214 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:30:58] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1006.eqiad.wmnet [13:31:31] (03CR) 10Elukey: [C: 03+2] Remove ORES deployment settings [puppet] - 10https://gerrit.wikimedia.org/r/975215 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:32:48] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1005.eqiad.wmnet [13:33:43] (03CR) 10AikoChou: [C: 03+1] ml-services: rollback xgboost/catboost models to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975205 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [13:33:49] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1004.eqiad.wmnet [13:35:05] (03CR) 10Elukey: [C: 03+2] Remove ORES configs and clusters [puppet] - 10https://gerrit.wikimedia.org/r/975216 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:35:31] (03PS9) 10Btullis: Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) [13:35:36] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1005.eqiad.wmnet [13:35:36] (03CR) 10Btullis: Configure the analytics prometheus instance to start scraping airflow (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [13:35:59] (PuppetFailure) firing: (2) Puppet has failed on cumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:35:59] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: rollback xgboost/catboost models to kserve 0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/975205 (https://phabricator.wikimedia.org/T347551) (owner: 10Ilias Sarantopoulos) [13:36:06] (03PS1) 10Kosta Harlan: [betalabs] ReportIncident: Relax rate limiting for reportincident action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) [13:36:25] (03CR) 10Elukey: [C: 03+2] profile::prometheus::ops: remove ORES Redis configs [puppet] - 10https://gerrit.wikimedia.org/r/975217 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:36:26] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1004.eqiad.wmnet [13:37:59] (PuppetFailure) firing: (2) Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:39:45] (03CR) 10Majavah: O:puppetserver: create role for per-project puppet server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975256 (https://phabricator.wikimedia.org/T351452) (owner: 10Majavah) [13:39:48] (03CR) 10Majavah: [C: 03+2] O:puppetserver: create role for per-project puppet server [puppet] - 10https://gerrit.wikimedia.org/r/975256 (https://phabricator.wikimedia.org/T351452) (owner: 10Majavah) [13:41:25] (03PS1) 10Jbond: puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272 [13:42:24] (03PS3) 10Elukey: admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) [13:42:26] (03PS3) 10Elukey: contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278) [13:42:28] (03PS3) 10Elukey: cloud: Remove ores-beta ATS settings [puppet] - 10https://gerrit.wikimedia.org/r/975218 (https://phabricator.wikimedia.org/T347278) [13:42:46] !log imported php-luasandbox 4.0.2-3+wmf2+bullseye1 to component/php74 for bullseye-wikimedia [13:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:20] (03CR) 10Jbond: [C: 04-1] "thanks for the patch but im not sure this is the best approch. i sent another fix in" [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah) [13:44:00] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1003.eqiad.wmnet [13:44:04] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1002.eqiad.wmnet [13:44:06] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [13:44:13] (03CR) 10Elukey: [C: 03+2] admin: remove ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/975219 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:44:19] (03CR) 10CI reject: [V: 04-1] puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond) [13:44:20] !log klausman@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ml-serve1003.eqiad.wmnet [13:44:21] (03CR) 10Elukey: [C: 03+2] contactgroups: remove old team-scoring [puppet] - 10https://gerrit.wikimedia.org/r/975220 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:45:35] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1003.eqiad.wmnet [13:45:36] !log klausman@cumin1001 END (ERROR) - Cookbook sre.puppet.migrate-host (exit_code=97) for host ml-serve1003.eqiad.wmnet [13:45:50] (03PS7) 10Vgutierrez: pybal,wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) [13:45:50] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1003.eqiad.wmnet [13:46:02] argh, ^C in wrong window... [13:46:22] (03PS1) 10Vgutierrez: service: Add ipip_encapsulation field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) [13:46:42] !log klausman@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ml-serve1003.eqiad.wmnet [13:47:04] !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1001.eqiad.wmnet [13:47:21] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-serve1002.eqiad.wmnet [13:47:24] !log klausman@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ml-serve1001.eqiad.wmnet [13:48:08] !log reenable puppet on dbprov2001 T351491 [13:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:13] T351491: pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'db1164.eqiad.wmnet' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))") on backup - https://phabricator.wikimedia.org/T351491 [13:49:14] (03CR) 10Elukey: [C: 03+2] cloud: Remove ores-beta ATS settings [puppet] - 10https://gerrit.wikimedia.org/r/975218 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [13:51:37] (03CR) 10Klausman: [C: 03+2] hiera: Temp rollback of Puppet v7 migration bits for ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/975275 (owner: 10Klausman) [13:52:56] PROBLEM - Check systemd state on kubernetes1007 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:03] (03CR) 10CI reject: [V: 04-1] service: Add ipip_encapsulation field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [13:54:48] (03PS1) 10Klausman: Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7" [puppet] - 10https://gerrit.wikimedia.org/r/975039 [13:55:22] (03CR) 10Klausman: [C: 03+2] Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7" [puppet] - 10https://gerrit.wikimedia.org/r/975039 (owner: 10Klausman) [13:55:59] (03PS1) 10Majavah: puppetserver: make rsync config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/975277 [13:56:12] (03CR) 10Filippo Giunchedi: [C: 03+1] Configure the analytics prometheus instance to start scraping airflow [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [13:56:27] (03CR) 10CI reject: [V: 04-1] puppetserver: make rsync config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/975277 (owner: 10Majavah) [13:57:51] (03PS2) 10Majavah: puppetserver: make rsync config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/975277 [13:57:59] (PuppetFailure) resolved: Puppet has failed on deploy1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:58:06] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:59:16] (03PS2) 10Jbond: puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272 [13:59:48] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/552/con" [puppet] - 10https://gerrit.wikimedia.org/r/975277 (owner: 10Majavah) [14:00:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/975265 (owner: 10Muehlenhoff) [14:00:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/553/con" [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond) [14:00:51] (03Abandoned) 10Majavah: P:puppetserver::git: ensure g10k isn't ran too early [puppet] - 10https://gerrit.wikimedia.org/r/975257 (owner: 10Majavah) [14:00:56] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update mysql CA for content and metadata backups [puppet] - 10https://gerrit.wikimedia.org/r/975231 (https://phabricator.wikimedia.org/T351491) (owner: 10Jcrespo) [14:00:59] (PuppetFailure) resolved: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:02:46] (03CR) 10Jbond: [V: 03+1] "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond) [14:05:07] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/975277 (owner: 10Majavah) [14:06:51] (03CR) 10Jbond: service: Add ipip_encapsulation field to ServiceLVS (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:08:07] PROBLEM - Check systemd state on an-worker1100 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:34] (03CR) 10Muehlenhoff: [C: 03+2] Cleanup obsolete Hiera files [puppet] - 10https://gerrit.wikimedia.org/r/975265 (owner: 10Muehlenhoff) [14:11:11] (03CR) 10Btullis: [C: 03+2] Configure Matomo's TagManager to write to existing tmpdir [puppet] - 10https://gerrit.wikimedia.org/r/975058 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [14:11:21] (03CR) 10Majavah: [V: 03+1 C: 03+2] puppetserver: make rsync config more flexible [puppet] - 10https://gerrit.wikimedia.org/r/975277 (owner: 10Majavah) [14:13:25] (03CR) 10Hnowlan: "This change is probably superseded by https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/973362 as there's a slightly more in" [deployment-charts] - 10https://gerrit.wikimedia.org/r/954248 (https://phabricator.wikimedia.org/T329049) (owner: 10Mvolz) [14:13:59] (03PS1) 10Arnaudb: mariadb: prepare copy of db1142 to db1242 [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) [14:15:33] (03PS1) 10Btullis: Fix the location of the matomo config override file [puppet] - 10https://gerrit.wikimedia.org/r/975283 [14:16:20] (03PS1) 10Elukey: Revert "Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7"" [puppet] - 10https://gerrit.wikimedia.org/r/975040 [14:16:38] (03CR) 10Btullis: [C: 03+2] Fix the location of the matomo config override file [puppet] - 10https://gerrit.wikimedia.org/r/975283 (owner: 10Btullis) [14:17:11] PROBLEM - Check systemd state on puppetserver1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:14] (03CR) 10Klausman: [C: 03+1] Revert "Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7"" [puppet] - 10https://gerrit.wikimedia.org/r/975040 (owner: 10Elukey) [14:17:19] (03CR) 10Elukey: [C: 03+2] Revert "Revert "hiera: migrate ml-serve1*.eqiad.wmnet to Puppet v7"" [puppet] - 10https://gerrit.wikimedia.org/r/975040 (owner: 10Elukey) [14:17:59] PROBLEM - Check systemd state on puppetserver2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:09] (03PS1) 10Klausman: Revert "hiera: Temp rollback of Puppet v7 migration bits for ml-serve1001" [puppet] - 10https://gerrit.wikimedia.org/r/975041 [14:18:20] (03PS2) 10Vgutierrez: service: Add ipip_encapsulation field to ServiceLVS [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) [14:18:38] (03CR) 10Vgutierrez: service: Add ipip_encapsulation field to ServiceLVS (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:18:55] (03CR) 10Elukey: [C: 03+1] Revert "hiera: Temp rollback of Puppet v7 migration bits for ml-serve1001" [puppet] - 10https://gerrit.wikimedia.org/r/975041 (owner: 10Klausman) [14:19:19] (03CR) 10Klausman: [C: 03+2] Revert "hiera: Temp rollback of Puppet v7 migration bits for ml-serve1001" [puppet] - 10https://gerrit.wikimedia.org/r/975041 (owner: 10Klausman) [14:20:00] (03CR) 10Marostegui: mariadb: prepare copy of db1142 to db1242 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:20:24] !log elukey@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-serve1001.eqiad.wmnet [14:20:44] !log elukey@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host ml-serve1001.eqiad.wmnet [14:22:03] (03PS2) 10Arnaudb: mariadb: prepare copy of db1142 to db1242 [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) [14:24:54] (03CR) 10Marostegui: mariadb: prepare copy of db1142 to db1242 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:25:49] PROBLEM - Check systemd state on puppetserver1002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:24] (03PS3) 10Arnaudb: mariadb: prepare copy of db1142 to db1242 [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) [14:26:34] (03PS1) 10Filippo Giunchedi: icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) [14:27:23] (03CR) 10CI reject: [V: 04-1] icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) (owner: 10Filippo Giunchedi) [14:27:53] (03PS2) 10Filippo Giunchedi: icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) [14:28:47] (03CR) 10CI reject: [V: 04-1] icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) (owner: 10Filippo Giunchedi) [14:30:50] (03CR) 10Marostegui: [C: 03+1] mariadb: prepare copy of db1142 to db1242 [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:31:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/975273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [14:31:48] (03PS3) 10Filippo Giunchedi: icinga: add alert audit via puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) [14:32:27] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [14:33:21] (03CR) 10Arnaudb: [C: 03+2] mariadb: prepare copy of db1142 to db1242 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974642 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [14:35:04] PROBLEM - Check systemd state on puppetserver2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:25] (03CR) 10Btullis: Send metrics from Airflow analytics test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [14:38:21] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:58] (03PS1) 10Elukey: Clean up ores configs not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) [14:39:32] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: provisionning db1242.eqiad.wmnet - T344036 [14:39:37] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [14:39:46] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: provisionning db1242.eqiad.wmnet - T344036 [14:39:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: provisionning db1242.eqiad.wmnet - T344036 [14:40:01] (03CR) 10Elukey: "Added Moritz and Filippo for the specific bits (bullseye/buster tracking and graphite)" [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:40:04] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: provisionning db1242.eqiad.wmnet - T344036 [14:41:08] (03CR) 10Muehlenhoff: Clean up ores configs not used anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:41:30] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:48] (03PS2) 10JMeybohm: Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074) [14:42:14] RECOVERY - Check systemd state on an-coord1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:22] (03CR) 10Klausman: "LGTM for everything except what Moritz noted." [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:42:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Cloning db1142 in db1242 for T344036', diff saved to https://phabricator.wikimedia.org/P53547 and previous config saved to /var/cache/conftool/dbconfig/20231117-144234-arnaudb.json [14:42:36] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:16] (03PS2) 10Elukey: Clean up ores configs not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) [14:44:18] (03CR) 10Elukey: Clean up ores configs not used anymore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:45:02] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1142.eqiad.wmnet onto db1242.eqiad.wmnet [14:45:20] (03CR) 10Klausman: team-ml: add alert for memory spike in inf services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [14:45:33] (03CR) 10Clément Goubert: [C: 03+1] Move mw appservers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/975228 (https://phabricator.wikimedia.org/T351074) (owner: 10JMeybohm) [14:45:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:47:06] (03CR) 10Filippo Giunchedi: [C: 03+1] Clean up ores configs not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:48:09] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate IP gateway for public1-a-codfw to spine switches - https://phabricator.wikimedia.org/T351532 (10cmooney) p:05Triage→03Medium [14:48:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:31] (03Abandoned) 10Elukey: hiera/modules: remove references to ORES roles [puppet] - 10https://gerrit.wikimedia.org/r/963683 (owner: 10Klausman) [14:48:36] (03CR) 10Cathal Mooney: Add BGP to the contributing protocols for aggregate routes on CRs (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/975070 (https://phabricator.wikimedia.org/T351456) (owner: 10Cathal Mooney) [14:48:53] (03CR) 10Elukey: [C: 03+2] Clean up ores configs not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/975285 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [14:50:20] RECOVERY - Check systemd state on kubernetes1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:54] (03PS1) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/975289 (https://phabricator.wikimedia.org/T349095) [14:52:10] (03CR) 10Filippo Giunchedi: "See https://phabricator.wikimedia.org/T320931#9340698 for a sample output" [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) (owner: 10Filippo Giunchedi) [14:53:22] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:40] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:58] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:57:56] PROBLEM - Check systemd state on ml-cache2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:12] (03CR) 10JHathaway: puppetserver::g10k: Ensure the control repo exists before we run g10k (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond) [14:58:29] (03CR) 10DCausse: rdf-streaming-updater: update values for application mode (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:00:34] RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:43] !log cr1-esams> request chassis fpc slot 1 online - T351304 [15:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:02] T351304: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 [15:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:10:42] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:10:54] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:14:02] (03PS3) 10Jbond: puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272 [15:14:16] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond) [15:18:08] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate IP gateway for public1-a-codfw to spine switches - https://phabricator.wikimedia.org/T351532 (10cmooney) [15:18:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:20:56] 10SRE, 10Infrastructure-Foundations, 10netops: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) 05Open→03Resolved Replaced. [15:24:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10MatthewVernon) a:05MatthewVernon→03RobH Hi @RobH. I think: hostnames: ms-be1076-1082 racking: no more than 1 server per rack, please (but they can go in racks that alread... [15:25:14] (03CR) 10Herron: [C: 03+1] "Nice! Looks great." [puppet] - 10https://gerrit.wikimedia.org/r/975284 (https://phabricator.wikimedia.org/T320931) (owner: 10Filippo Giunchedi) [15:25:46] (03PS4) 10Vgutierrez: interface: Allow creating IPIP interfaces w/o an endpoint [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069) [15:25:48] (03PS8) 10Vgutierrez: pybal,wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) [15:27:38] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/554/console" [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:30:25] --/win 14 [15:30:34] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [15:31:14] (03PS1) 10Majavah: sslcert: use concat to generate trusted_ca [puppet] - 10https://gerrit.wikimedia.org/r/975299 [15:32:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T348183)', diff saved to https://phabricator.wikimedia.org/P53549 and previous config saved to /var/cache/conftool/dbconfig/20231117-153225-arnaudb.json [15:32:31] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:32:48] (03PS9) 10Vgutierrez: pybal,wmflib::service: Add ipip_encapsulation flag on lvs [puppet] - 10https://gerrit.wikimedia.org/r/974623 (https://phabricator.wikimedia.org/T351069) [15:32:53] (03PS2) 10Majavah: sslcert: use concat to generate trusted_ca [puppet] - 10https://gerrit.wikimedia.org/r/975299 [15:33:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10MatthewVernon) a:05MatthewVernon→03RobH Hi. I think: hostnames: ms-be20[74-80] racking: not more than 1 per rack, please, though they can share with existing nodes (e.g.... [15:33:59] (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:36:37] (03PS54) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [15:37:59] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/975299 (owner: 10Majavah) [15:38:08] (03PS1) 10AikoChou: ml-services: update revertrisk-la image and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/975304 (https://phabricator.wikimedia.org/T347550) [15:38:47] (03CR) 10Majavah: [C: 03+2] sslcert: use concat to generate trusted_ca [puppet] - 10https://gerrit.wikimedia.org/r/975299 (owner: 10Majavah) [15:38:51] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [15:38:57] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [15:40:44] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:57] (03CR) 10BBlack: [C: 03+1] "Seems right to me, nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/975253 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:47:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P53550 and previous config saved to /var/cache/conftool/dbconfig/20231117-154731-arnaudb.json [15:50:24] (03CR) 10Jbond: puppet: update gat_ca_server to also support srv discovery (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [15:50:30] (03PS9) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [15:51:04] (03CR) 10Bking: [C: 03+2] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/975289 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:52:03] (03CR) 10Bking: [C: 03+1] "self-merging, as change this was already approved by ServiceOps in I318e7557c72b71587dafc0d039e0c64493f865d1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/975289 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:52:34] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] Update the list of NavigationPopups gadget names (0313 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [15:52:59] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) a:03Xqt [15:53:00] (03CR) 10Ssingh: [V: 03+1] "Revising a bit after discussion with bblack and how the etcd paths should look like." [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:54:39] (03CR) 10Bking: [C: 03+2] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/975289 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [15:55:07] (03PS2) 10Ssingh: conftool: introduce schema and host file for dnsboxes [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) [15:55:08] RECOVERY - Check systemd state on ml-cache2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:03] !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:56:11] !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:56:18] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:57:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.888546458538797s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:57:46] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:57:55] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:58:02] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:58:03] (03CR) 10CI reject: [V: 04-1] puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [15:58:08] (03CR) 10Btullis: [C: 03+1] "Looks good. Many thanks." [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn) [15:58:09] !log bking@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:58:15] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:59:59] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "response time was incredible:) thanks. also noop in compiler: https://puppet-compiler.wmflabs.org/output/975093/557/" [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn) [16:01:01] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "I see puppet is disabled on matomo1002 - unrelated work?" [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn) [16:01:33] (03CR) 10BBlack: [C: 03+1] "LGTM from a logical perspective" [puppet] - 10https://gerrit.wikimedia.org/r/975009 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:02:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 3.0904484311989004s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:02:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P53551 and previous config saved to /var/cache/conftool/dbconfig/20231117-160238-arnaudb.json [16:02:46] (03CR) 10Jbond: "lgtm suggestion inline" [puppet] - 10https://gerrit.wikimedia.org/r/975080 (https://phabricator.wikimedia.org/T351465) (owner: 10JHathaway) [16:03:08] (03PS1) 10Btullis: Fix an issue with the matomo TagManager configuration [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910) [16:03:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:03:52] (03PS2) 10Btullis: Fix an issue with the matomo TagManager configuration [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910) [16:04:07] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [16:05:40] (03CR) 10Btullis: [C: 03+1] piwik: avoid hardcoded PHP version string (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn) [16:06:56] (03PS3) 10Btullis: Fix an issue with the matomo TagManager configuration [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910) [16:08:17] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/558/con" [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [16:08:50] (03CR) 10Dreamy Jazz: [C: 03+1] "This config change makes sense to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan) [16:09:12] (03CR) 10Majavah: [C: 03+1] puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond) [16:10:05] (03CR) 10Dzahn: [V: 03+1 C: 03+2] piwik: avoid hardcoded PHP version string (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975093 (owner: 10Dzahn) [16:11:12] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix an issue with the matomo TagManager configuration [puppet] - 10https://gerrit.wikimedia.org/r/975311 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis) [16:11:14] (03PS1) 10Ebernhardson: cirrus updater: Expand consumer to include itwiki and frwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/975320 [16:11:16] (03PS1) 10Ebernhardson: cirrus updater: Remove consumer start time override [deployment-charts] - 10https://gerrit.wikimedia.org/r/975321 [16:12:47] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Expand consumer to include itwiki and frwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/975320 (owner: 10Ebernhardson) [16:13:40] (03Merged) 10jenkins-bot: cirrus updater: Expand consumer to include itwiki and frwiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/975320 (owner: 10Ebernhardson) [16:14:10] (03PS1) 10Vgutierrez: interface: Add a clsact helper [puppet] - 10https://gerrit.wikimedia.org/r/975324 (https://phabricator.wikimedia.org/T351069) [16:16:23] (03PS6) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) [16:16:38] PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:30] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] Update the list of NavigationPopups gadget names (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975021 (https://phabricator.wikimedia.org/T351314) (owner: 10WMDE-Fisch) [16:17:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T348183)', diff saved to https://phabricator.wikimedia.org/P53552 and previous config saved to /var/cache/conftool/dbconfig/20231117-161744-arnaudb.json [16:17:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [16:17:49] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:18:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [16:18:02] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update revertrisk-la image and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/975304 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [16:18:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T348183)', diff saved to https://phabricator.wikimedia.org/P53553 and previous config saved to /var/cache/conftool/dbconfig/20231117-161806-arnaudb.json [16:18:59] (03PS2) 10Vgutierrez: interface: Add a clsact helper [puppet] - 10https://gerrit.wikimedia.org/r/975324 (https://phabricator.wikimedia.org/T351069) [16:20:37] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/559/console" [puppet] - 10https://gerrit.wikimedia.org/r/975324 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:21:02] (03CR) 10Kosta Harlan: [betalabs] ReportIncident: Relax rate limiting for reportincident action (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975270 (https://phabricator.wikimedia.org/T351299) (owner: 10Kosta Harlan) [16:24:11] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/974660/560/" [puppet] - 10https://gerrit.wikimedia.org/r/974660 (https://phabricator.wikimedia.org/T351333) (owner: 10Dzahn) [16:26:14] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:26:29] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:27:18] (03CR) 10Jbond: [C: 03+2] puppetserver::g10k: Ensure the control repo exists before we run g10k [puppet] - 10https://gerrit.wikimedia.org/r/975272 (owner: 10Jbond) [16:29:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.136149810350958s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:29:43] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:33:45] (03PS7) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) [16:36:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:38:31] (03PS10) 10Jbond: puppet: update gat_ca_server to also support srv discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/974995 (https://phabricator.wikimedia.org/T341496) [16:39:54] RECOVERY - cassandra-b service on aqs1012 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:40:23] (03Abandoned) 10Mvolz: rest-gateway: fix citoid regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/954248 (https://phabricator.wikimedia.org/T329049) (owner: 10Mvolz) [16:40:36] RECOVERY - cassandra-b SSL 10.64.32.145:7000 on aqs1012 is OK: SSL OK - Certificate aqs1012-b valid until 2024-05-19 08:40:12 +0000 (expires in 183 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:41:00] (03PS2) 10Ebernhardson: cirrus updater: Remove consumer start time override [deployment-charts] - 10https://gerrit.wikimedia.org/r/975321 [16:41:02] (03PS1) 10Ebernhardson: cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333 [16:41:10] (03CR) 10CI reject: [V: 04-1] cirrus updater: Remove consumer start time override [deployment-charts] - 10https://gerrit.wikimedia.org/r/975321 (owner: 10Ebernhardson) [16:41:12] (03CR) 10CI reject: [V: 04-1] cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333 (owner: 10Ebernhardson) [16:43:14] (03PS2) 10Ebernhardson: cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333 [16:43:23] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333 (owner: 10Ebernhardson) [16:44:15] (03Merged) 10jenkins-bot: cirrus updater: Use alternate form of iso8601 timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/975333 (owner: 10Ebernhardson) [16:46:44] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:46:59] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.2174800845539s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:11:49] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1142.eqiad.wmnet onto db1242.eqiad.wmnet [17:12:44] (03CR) 10AikoChou: [C: 03+2] ml-services: update revertrisk-la image and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/975304 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [17:13:10] RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:53] (03Merged) 10jenkins-bot: ml-services: update revertrisk-la image and model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/975304 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [17:29:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti for eqiad - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF) [17:32:04] 10SRE, 10SRE-Access-Requests: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10spatton) Hi @MatthewVernon, this is approved from my side, thanks much! [17:34:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti for eqiad - https://phabricator.wikimedia.org/T349925 (10VRiley-WMF) ganeti1035.eqiad.wmnet Service Tag: 6DN8PZ3 Asset: WMF11370 Express Service Code: 13885792383 Rack: A2 Position: U33 Port: 41 Cableid: 230304500230... [17:40:45] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1035 [17:42:11] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1035 [17:43:24] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1036 [17:45:18] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1037 [17:45:58] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1036 [17:46:27] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1037 [17:46:38] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1038 [17:47:31] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED [17:47:45] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1038 [17:48:36] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1036.mgmt.eqiad.wmnet with reboot policy FORCED [17:49:15] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1037.mgmt.eqiad.wmnet with reboot policy FORCED [17:50:27] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1038.mgmt.eqiad.wmnet with reboot policy FORCED [17:56:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:58:43] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED [17:59:05] (03PS8) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) [17:59:48] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1036.mgmt.eqiad.wmnet with reboot policy FORCED [17:59:54] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1037.mgmt.eqiad.wmnet with reboot policy FORCED [18:01:10] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1038.mgmt.eqiad.wmnet with reboot policy FORCED [18:02:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:04:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10RobH) [18:04:59] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED [18:05:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q2:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349839 (10RobH) a:05RobH→03None Updated task description with updated racking details and removing myself as assignee. Once these arrive on-site, one of our #ops-codfw engineers w... [18:05:43] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED [18:13:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:16:56] VRiley: You forget to commit changes while working on ganeti? [18:17:54] Oh, I test them before I commit them. But I'll double check [18:18:23] i often fire the dns cookbook [18:18:28] swap tabs, and its just sitting there waiting for me to confirm [18:18:38] when i have that dns error that is why 99% of the time ; D [18:19:22] A variety of ganeti and wmf a/aaaa records are ready for committing [18:19:42] (and ptr) [18:33:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10RobH) [18:34:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10RobH) a:05RobH→03VRiley-WMF [18:53:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10MoritzMuehlenhoff) Please also enable virtualisation for these in the BIOS, they will serve as virt servers. [19:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:08:36] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:10:46] (03CR) 10Dzahn: "nothing seems to use it per openstack-browser, but simplelamp2 is used and was basically the same change and noop" [puppet] - 10https://gerrit.wikimedia.org/r/975094 (owner: 10Dzahn) [19:11:29] (03PS3) 10Krinkle: Set new $wgMicroStashType setting to "mcrouter-primary-dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [19:11:51] (03CR) 10Krinkle: [C: 03+1] "Approved for deployment at your earliest convenience." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974506 (https://phabricator.wikimedia.org/T336004) (owner: 10D3r1ck01) [19:16:46] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Switch failure: asw2-a8-eqiad Aug 13th 2021 - https://phabricator.wikimedia.org/T288834 (10cmooney) I believe that the bug that caused this has been fixed in 21.4R3-S5 for EX4300 devices. [19:18:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:59] (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:43:34] (03PS6) 10Dzahn: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [19:51:14] !log brion regenerating .m3u8 streaming manifests for all video files on mwmaint2002 (cleanup for T350996) [19:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:19] T350996: HLS meta playlist .m3u8 includes not-yet-made transcodes - https://phabricator.wikimedia.org/T350996 [19:52:29] (03PS7) 10Dzahn: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [19:53:02] (03PS1) 10Jdlrobson: Revert "mw.notify: Limit width of overlay to max-width-page-container" [skins/Vector] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/975366 (https://phabricator.wikimedia.org/T349622) [19:57:26] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "I got it to be a noop now. could merge it like this without a change: https://puppet-compiler.wmflabs.org/output/964176/564/planet1002.eqi" [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [20:03:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:10:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:13:26] (03PS1) 10Dduvall: gitlab_runner: Allow rsyncd access to zuul.devtools.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/975360 (https://phabricator.wikimedia.org/T351329) [20:17:59] (03CR) 10Dduvall: "I cherry picked this to puppetmaster-1001.devtools.eqiad1.wikimedia.cloud, applied, and was able to rsync from zuul.devtools.wmcloud.org w" [puppet] - 10https://gerrit.wikimedia.org/r/975360 (https://phabricator.wikimedia.org/T351329) (owner: 10Dduvall) [20:22:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10RobH) a:05RobH→03None [20:30:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) [20:38:30] 10SRE: molly-guard does not apply to `systemctl reboot` - https://phabricator.wikimedia.org/T351570 (10taavi) [20:41:09] (MediaWikiLatencyExceeded) firing: Average latency high: codfw mw-wikifunctions (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:46:09] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw mw-wikifunctions (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:59:04] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:59:06] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:10:17] (03CR) 10Dzahn: [V: 03+1 C: 03+2] planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [21:13:27] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "decided to just merge to move forward. we can always build on top of it now. complete noop on existing servers confirmed." [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [21:14:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T348183)', diff saved to https://phabricator.wikimedia.org/P53556 and previous config saved to /var/cache/conftool/dbconfig/20231117-211428-arnaudb.json [21:14:35] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:14:50] (03CR) 10Greg Grossmeier: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [21:29:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P53557 and previous config saved to /var/cache/conftool/dbconfig/20231117-212935-arnaudb.json [21:30:00] (03CR) 10Dwisehaupt: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/975242 (https://phabricator.wikimedia.org/T349402) (owner: 10Muehlenhoff) [21:44:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P53558 and previous config saved to /var/cache/conftool/dbconfig/20231117-214441-arnaudb.json [21:53:42] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: Allow rsyncd access to zuul.devtools.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/975360 (https://phabricator.wikimedia.org/T351329) (owner: 10Dduvall) [21:59:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T348183)', diff saved to https://phabricator.wikimedia.org/P53559 and previous config saved to /var/cache/conftool/dbconfig/20231117-215947-arnaudb.json [21:59:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [21:59:53] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:00:04] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [22:13:42] (03CR) 10BCornwall: readme: Update repo location of varnishkafka (031 comment) [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (owner: 10BCornwall) [22:13:55] (03PS2) 10BCornwall: readme: Update repo location of varnishkafka [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (https://phabricator.wikimedia.org/T347623) [22:14:39] (03CR) 10BCornwall: [C: 03+2] readme: Update repo location of varnishkafka [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (https://phabricator.wikimedia.org/T347623) (owner: 10BCornwall) [22:14:41] (03CR) 10BCornwall: [V: 03+2 C: 03+2] readme: Update repo location of varnishkafka [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 (https://phabricator.wikimedia.org/T347623) (owner: 10BCornwall) [22:30:16] (03CR) 10RhinosF1: [C: 03+1] simplelap: avoid hardcoded PHP version string [puppet] - 10https://gerrit.wikimedia.org/r/975094 (owner: 10Dzahn) [22:31:00] (03PS2) 10Krinkle: [BETA HACK] confd: Fix confd hostname [puppet] - 10https://gerrit.wikimedia.org/r/941478 [22:31:23] (03Abandoned) 10Krinkle: [BETA HACK] Attempt to secure Puppet DB better [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle) [23:08:21] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:18:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:34:14] (PuppetFailure) firing: Puppet has failed on ml-serve1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:38:29] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED [23:39:13] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1035.mgmt.eqiad.wmnet with reboot policy FORCED