[00:03:16] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2023-08-22 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:05:32] (DatasourceError) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:08:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:10:32] (DatasourceError) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:16:40] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2023-08-22 00:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:20:36] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2023-08-22 00:00:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:31:36] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2023-08-22 00:00:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:39:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953490 [00:39:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953490 (owner: 10TrainBranchBot) [00:43:58] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003'] [00:50:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['moss-be2003'] [00:52:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) 05Resolved→03Open [00:54:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [00:54:57] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [00:55:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/953490 (owner: 10TrainBranchBot) [01:25:46] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10nshahquinn-wmf) FYI, Urllib3 version 2, released in April 2023, [removed the fallback from serverAltName to commonName](https://github.com/urllib3/urllib3/blob/main/CHA... [01:36:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003'] [01:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [01:42:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['moss-be2003'] [01:43:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [01:43:21] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [01:43:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye [01:43:29] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with... [01:43:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [01:44:06] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [01:44:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye [01:44:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with... [01:50:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2039.codfw.wmnet with OS bullseye [01:50:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye [01:50:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2037.codfw.wmnet with OS bullseye [01:50:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye [01:50:47] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye [01:50:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye [02:08:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2037.codfw.wmnet with reason: host reimage [02:31:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2037.codfw.wmnet with reason: host reimage [02:33:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) [02:45:52] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [03:01:42] (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:31:10] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:26] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:50:41] (03PS1) 10Anzx: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953750 (https://phabricator.wikimedia.org/T345316) [03:52:55] (03Abandoned) 10Anzx: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953750 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [04:00:16] (03PS1) 10Anzx: tlywiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953751 (https://phabricator.wikimedia.org/T345316) [04:28:42] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:26] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:47:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52120 and previous config saved to /var/cache/conftool/dbconfig/20230831-044746-root.json [04:50:30] (03PS1) 10Marostegui: Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/953651 [04:52:46] (03CR) 10Winston Sung: "This change is ready for review." (038 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung) [04:54:03] (03PS4) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) [04:54:40] (03CR) 10CI reject: [V: 04-1] SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung) [04:55:30] (03PS5) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) [04:56:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s6 T345223 [04:56:45] T345223: Switchover s6 master (db1131 -> db1173) - https://phabricator.wikimedia.org/T345223 [04:57:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s6 T345223 [04:57:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1173 with weight 0 T345223', diff saved to https://phabricator.wikimedia.org/P52121 and previous config saved to /var/cache/conftool/dbconfig/20230831-045719-marostegui.json [04:59:32] (03PS4) 10KartikMistry: Enable MinT translation service for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro) [05:01:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/953487 (https://phabricator.wikimedia.org/T345223) (owner: 10Gerrit maintenance bot) [05:02:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52122 and previous config saved to /var/cache/conftool/dbconfig/20230831-050250-root.json [05:16:33] (03PS2) 10KartikMistry: Update cxserver to 2023-08-29-191442-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952568 (https://phabricator.wikimedia.org/T345170) [05:16:35] (03PS2) 10Marostegui: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/953488 (https://phabricator.wikimedia.org/T345223) (owner: 10Gerrit maintenance bot) [05:17:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52123 and previous config saved to /var/cache/conftool/dbconfig/20230831-051755-root.json [05:28:11] !log Starting s6 eqiad failover from db1131 to db1173 - T345223 [05:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:17] T345223: Switchover s6 master (db1131 -> db1173) - https://phabricator.wikimedia.org/T345223 [05:28:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T345223', diff saved to https://phabricator.wikimedia.org/P52124 and previous config saved to /var/cache/conftool/dbconfig/20230831-052825-marostegui.json [05:28:29] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [05:28:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1173 to s6 primary and set section read-write T345223', diff saved to https://phabricator.wikimedia.org/P52125 and previous config saved to /var/cache/conftool/dbconfig/20230831-052852-marostegui.json [05:30:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 T345223', diff saved to https://phabricator.wikimedia.org/P52126 and previous config saved to /var/cache/conftool/dbconfig/20230831-053035-root.json [05:31:07] (03CR) 10Marostegui: [C: 03+2] Revert "db1201: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/953651 (owner: 10Marostegui) [05:32:57] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/953488 (https://phabricator.wikimedia.org/T345223) (owner: 10Gerrit maintenance bot) [05:33:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52127 and previous config saved to /var/cache/conftool/dbconfig/20230831-053300-root.json [05:34:45] (03PS1) 10Marostegui: db1131: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953753 [05:35:22] (03CR) 10Marostegui: [C: 03+2] db1131: Upgrade to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/953753 (owner: 10Marostegui) [05:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [05:43:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1182 T344309', diff saved to https://phabricator.wikimedia.org/P52128 and previous config saved to /var/cache/conftool/dbconfig/20230831-054305-root.json [05:43:13] T344309: Compile and package MariaDB 11.0.3 10.6.15, 10.4.31 - https://phabricator.wikimedia.org/T344309 [05:43:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52129 and previous config saved to /var/cache/conftool/dbconfig/20230831-054314-root.json [05:45:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52130 and previous config saved to /var/cache/conftool/dbconfig/20230831-054542-root.json [05:48:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Repooling after maintenance ', diff saved to https://phabricator.wikimedia.org/P52131 and previous config saved to /var/cache/conftool/dbconfig/20230831-054805-root.json [05:58:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52132 and previous config saved to /var/cache/conftool/dbconfig/20230831-055819-root.json [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T0600). [06:00:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52133 and previous config saved to /var/cache/conftool/dbconfig/20230831-060047-root.json [06:13:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52134 and previous config saved to /var/cache/conftool/dbconfig/20230831-061324-root.json [06:15:44] (03CR) 10Volans: [C: 04-1] "LGTM, just one small missing bit and a couple of suggestions inline" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [06:15:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52135 and previous config saved to /var/cache/conftool/dbconfig/20230831-061551-root.json [06:22:37] (03PS1) 10KartikMistry: Enable Section and Content Translation in 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953756 (https://phabricator.wikimedia.org/T343211) [06:28:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52136 and previous config saved to /var/cache/conftool/dbconfig/20230831-062829-root.json [06:30:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52137 and previous config saved to /var/cache/conftool/dbconfig/20230831-063056-root.json [06:33:57] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:41:07] (03CR) 10Filippo Giunchedi: mesh: add tracing support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [06:42:58] (03PS4) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [06:43:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52138 and previous config saved to /var/cache/conftool/dbconfig/20230831-064333-root.json [06:46:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52139 and previous config saved to /var/cache/conftool/dbconfig/20230831-064601-root.json [06:48:05] (03PS5) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [06:48:11] (03CR) 10Filippo Giunchedi: mesh: add tracing support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [06:53:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1002.wikimedia.org [06:57:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1002.wikimedia.org [06:58:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52140 and previous config saved to /var/cache/conftool/dbconfig/20230831-065838-root.json [07:00:05] Amir1, apergos, and jnuche: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T0700) [07:00:05] thed and kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] * kart_ is here [07:00:40] morning! we have no trainees signed up today but two patches to go. kart_ I assume you are self-deploy? I don't know where thedj is, not in channel at the moment. so kart_ you'll go first if that's ok. [07:00:47] kart_: you can self serve? [07:00:57] Amir1: yeah [07:01:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52141 and previous config saved to /var/cache/conftool/dbconfig/20230831-070105-root.json [07:01:40] Have fun! [07:01:43] (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:19] o/ :] [07:02:34] apergos are the trainings always on Thursday? [07:02:39] yes they are [07:02:49] it's a fixed slot, see the dpeloyment calendar :-) [07:03:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro) [07:03:41] great [07:03:57] there's a workboard to request a training, if you know someone interested [07:03:59] Tyler asked me to participate so I will join next week session 8) [07:04:21] (03Merged) 10jenkins-bot: Enable MinT translation service for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953216 (owner: 10Abijeet Patro) [07:04:27] oh! you're interested -) well yes, sounds great. make it official by making a request on that phab board if you like [07:04:41] https://phabricator.wikimedia.org/project/board/5265/ [07:05:25] !log kartik@deploy1002 Started scap: Backport for [[gerrit:953216|Enable MinT translation service for testwiki]] [07:05:45] ah there we go, I was wondering what was happening :-) [07:07:00] !log kartik@deploy1002 abi and kartik: Backport for [[gerrit:953216|Enable MinT translation service for testwiki]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:09:50] !log kartik@deploy1002 abi and kartik: Continuing with sync [07:12:48] (03CR) 10Muehlenhoff: [C: 03+2] Revert "confd: Make confd_prometheus_metrics.py 3.4-compatible" [puppet] - 10https://gerrit.wikimedia.org/r/953238 (owner: 10Muehlenhoff) [07:13:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52142 and previous config saved to /var/cache/conftool/dbconfig/20230831-071343-root.json [07:15:44] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:953216|Enable MinT translation service for testwiki]] (duration: 10m 18s) [07:16:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52143 and previous config saved to /var/cache/conftool/dbconfig/20230831-071610-root.json [07:19:10] apergos: I'm done with config change deployment. [07:19:19] great! [07:19:31] still no thedj unfortunately [07:19:40] :/ [07:20:09] if anyone has another way to reach them, I'll remain here with the window open for another 15 minutes or so [07:20:58] (03PS10) 10Slyngshede: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [07:21:38] (03PS2) 10Muehlenhoff: Openstack: remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952848 [07:23:44] (03PS3) 10Muehlenhoff: Openstack: remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952848 [07:24:17] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:04] (03CR) 10Muehlenhoff: Openstack: remove support for Stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952848 (owner: 10Muehlenhoff) [07:25:53] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43070/console" [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [07:27:47] (03CR) 10Muehlenhoff: [C: 03+2] local_dev: Update image [puppet] - 10https://gerrit.wikimedia.org/r/953205 (owner: 10Muehlenhoff) [07:28:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52144 and previous config saved to /var/cache/conftool/dbconfig/20230831-072848-root.json [07:29:28] (03CR) 10Muehlenhoff: Stop building stretch images and update monitoring for the docker registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953201 (owner: 10Muehlenhoff) [07:29:30] (03CR) 10Muehlenhoff: [C: 03+2] Stop building stretch images and update monitoring for the docker registry [puppet] - 10https://gerrit.wikimedia.org/r/953201 (owner: 10Muehlenhoff) [07:30:14] (03CR) 10Slyngshede: [C: 03+1] "Look good, links point to the right locations." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953673 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:31:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P52145 and previous config saved to /var/cache/conftool/dbconfig/20230831-073115-root.json [07:31:29] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update links to create an account and password reset to point to Bitu [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/953673 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:32:44] (03PS3) 10Slyngshede: C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008) [07:33:03] (03PS4) 10Slyngshede: C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008) [07:35:06] (03CR) 10CI reject: [V: 04-1] C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [07:37:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [07:37:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [07:37:08] it looks like something came up for thedj, so hopefully they will reschedule, I'll close the window for today [07:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T343718)', diff saved to https://phabricator.wikimedia.org/P52146 and previous config saved to /var/cache/conftool/dbconfig/20230831-073713-ladsgroup.json [07:37:19] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:37:34] !log UTC morning backport and config window done [07:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [07:38:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [07:39:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T343718)', diff saved to https://phabricator.wikimedia.org/P52147 and previous config saved to /var/cache/conftool/dbconfig/20230831-073921-ladsgroup.json [07:40:22] (03PS4) 10Jelto: gitlab: Support loading of local gems [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [07:44:08] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43071/console" [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [07:44:19] (03PS1) 10Muehlenhoff: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/953959 [07:44:59] (03PS2) 10Muehlenhoff: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/953959 [07:48:49] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10ayounsi) FYI there is now a pending diff for: ` [edit forwarding-options dhcp-relay] + /* T337345 */ + forward-snooped-clients non-... [07:49:33] (03PS5) 10Slyngshede: C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008) [07:50:49] (03PS1) 10Muehlenhoff: networking fact: Remove check for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953960 [07:50:54] (03CR) 10Muehlenhoff: [C: 03+2] Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/953959 (owner: 10Muehlenhoff) [07:51:08] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [07:51:26] (03CR) 10CI reject: [V: 04-1] networking fact: Remove check for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953960 (owner: 10Muehlenhoff) [07:52:48] (03PS2) 10Muehlenhoff: networking fact: Remove check for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953960 [07:52:54] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1009.eqiad.wmnet [07:54:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P52148 and previous config saved to /var/cache/conftool/dbconfig/20230831-075428-ladsgroup.json [07:56:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:57:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:57:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T343718)', diff saved to https://phabricator.wikimedia.org/P52149 and previous config saved to /var/cache/conftool/dbconfig/20230831-075709-ladsgroup.json [07:57:15] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:58:39] (03CR) 10Ladsgroup: [C: 03+1] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [08:00:56] jouncebot: nowandnext [08:00:56] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [08:00:56] In 1 hour(s) and 59 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1000) [08:00:56] In 1 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1000) [08:03:30] slyngs: we coordinate here [08:03:40] (03CR) 10Ladsgroup: [C: 03+2] Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [08:03:54] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1009.eqiad.wmnet [08:03:58] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:04:08] slyngs: do you know about https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage [08:04:14] (03CR) 10Vgutierrez: [C: 03+2] Fix cache_upload timeouts in single-backend sites [puppet] - 10https://gerrit.wikimedia.org/r/953700 (https://phabricator.wikimedia.org/T288106) (owner: 10BBlack) [08:04:24] (03Merged) 10jenkins-bot: Disable user creation on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952209 (https://phabricator.wikimedia.org/T345226) (owner: 10Andrew Bogott) [08:04:32] Amir1: I do not [08:04:53] you need to install a browser extension to test it when the patch arrives to mwdebug hosts [08:04:56] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:952209|Disable user creation on wikitech (T345226)]] [08:05:01] T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 [08:05:29] Amir1: wikitech is not compatible with that [08:05:42] ah, yeah, okay [08:05:57] I keep forgetting it's a special snowflake [08:06:13] slyngs: scratch that, it doesnt' work with that [08:06:14] Hopefully removing the signup will push us towards it not being special [08:06:28] (03PS4) 10Vgutierrez: trafficserver: Set active timeouts to 1h in upload [puppet] - 10https://gerrit.wikimedia.org/r/953638 (https://phabricator.wikimedia.org/T341755) [08:06:31] yep, it's one step closer. still many to go though :/ [08:06:33] !log ladsgroup@deploy1002 ladsgroup and andrew: Backport for [[gerrit:952209|Disable user creation on wikitech (T345226)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:06:42] !log ladsgroup@deploy1002 ladsgroup and andrew: Continuing with sync [08:06:58] what's the next blocker? [08:07:16] also when are we going to remove labtestwiki? [08:07:47] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1010.eqiad.wmnet [08:08:17] very good question. we'd need a replacement for it's 2fa functionality I think, not sure if there are plans for idp/idm instances against the codfw1dev ldap cluster [08:08:38] slyngs: the design :( [08:08:43] and the next blocker is SSH key management in IDM, that would let us undeploy OSM [08:08:57] Open Street Map? [08:09:05] openstackmanager [08:09:12] ah, that makes more sense [08:09:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P52150 and previous config saved to /var/cache/conftool/dbconfig/20230831-080934-ladsgroup.json [08:09:44] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Set active timeouts to 1h in upload [puppet] - 10https://gerrit.wikimedia.org/r/953638 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [08:10:54] We're already working on the feature to remove openstackmanager [08:11:20] Not much is left really, mostly SSH key management, which has been implemented but not enabled [08:11:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [08:12:05] I'll create a new developer account and check if it's able to login to wikitech [08:12:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:12:51] (03PS1) 10Ayounsi: gNMI: remove from cloudsw, add to cr [homer/public] - 10https://gerrit.wikimedia.org/r/953961 (https://phabricator.wikimedia.org/T326322) [08:13:02] is there a log of new accounts created via idm? [08:13:30] if you manage to allow wikitech be integrated with the rest of the fleet, I'll buy you eight beers in the next in-person, one of each incident we had because of wikitech being a special snowflake [08:13:47] one *for [08:13:50] taavi: Yes, right now moritzm and I are getting an email on all signups. It's also logged on the server in the application log [08:14:18] maybe sense it to logstash too [08:14:20] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1010.eqiad.wmnet [08:14:27] (03CR) 10Ayounsi: [C: 03+2] gNMI: remove from cloudsw, add to cr [homer/public] - 10https://gerrit.wikimedia.org/r/953961 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [08:14:31] but we probably need something more performant [08:14:38] *permanent [08:15:00] (03Merged) 10jenkins-bot: gNMI: remove from cloudsw, add to cr [homer/public] - 10https://gerrit.wikimedia.org/r/953961 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [08:15:03] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:952209|Disable user creation on wikitech (T345226)]] (duration: 10m 06s) [08:15:09] slyngs: done ^ [08:15:11] T345226: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 [08:15:14] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1011.eqiad.wmnet [08:15:56] it couldn't deploy it to snapshot1010.eqiad.wmnet, mw2287.codfw.wmnet and mw2285.codfw.wmnet [08:16:27] (03PS1) 10Ilias Sarantopoulos: ores-extension: fix arwiki likelybad threshold [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953962 (https://phabricator.wikimedia.org/T345305) [08:16:43] It worked.... AMAZING... I'm, I created a an new account, promptly forgot the username and failed to login, then recovered the username and now I can log in [08:17:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T343718)', diff saved to https://phabricator.wikimedia.org/P52151 and previous config saved to /var/cache/conftool/dbconfig/20230831-081705-ladsgroup.json [08:17:11] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:17:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:17:36] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Increase send_timeout in upload [puppet] - 10https://gerrit.wikimedia.org/r/953678 (https://phabricator.wikimedia.org/T341755) (owner: 10Vgutierrez) [08:17:44] Amir1: If possible I'd like the beers spread out over time, I can sleep after more than two beers, to horrors of growing old [08:17:47] the [08:17:56] haha, sure :D [08:18:05] slyngs: the gerrit sign up link needs updating it seems [08:18:12] I also just fixed a bunch of wikitech pages [08:18:41] we probably need to send an email to wikitech and possibly a message in engineering-all [08:18:46] yes please [08:18:49] I'll ping hashar about gerrit, moritzm is fixing idp... And thank you for updating the wikitech pages. [08:19:08] Amir1: do you know if we can customize the error message on https://wikitech.wikimedia.org/wiki/Special:CreateAccount without affecting other pages? [08:19:16] yeah, the updated CAS package with links pointing to Bitu will go out later the day [08:19:19] Right, I'll do that now, and include a link to the patch, in case we need to revert [08:19:34] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr3-ulsfo [08:19:56] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1011.eqiad.wmnet [08:20:20] taavi: woudl this help? https://wikitech.wikimedia.org/wiki/Special:CreateAccount/?uselang=qqx [08:20:42] Who to email, just sre@... we need to hit the developers as well [08:20:52] wikitech-l? [08:20:59] yeah, wikitech-l [08:21:01] Yes, just remembers that :-) [08:21:11] Amir1: I guess I can change MediaWiki:permissionserrorstext-withaction, but I think that would affect all of the special pages and not just that specific one [08:21:17] unless I can use a magic word to vary the message? [08:21:27] !log set send_timeout to 3620s in the upload cluster via cumin to avoid a varnish restart https://gerrit.wikimedia.org/r/c/operations/puppet/+/953678 - T341755 [08:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:32] we could probably but let's not [08:21:32] T341755: Cannot download large (2GB) files with 10Mbps or slower network due to ATS timeout - https://phabricator.wikimedia.org/T341755 [08:23:22] !log elukey@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka main-eqiad cluster: Reboot kafka nodes [08:23:28] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1012.eqiad.wmnet [08:24:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-ulsfo [08:24:15] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr4-ulsfo [08:24:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T343718)', diff saved to https://phabricator.wikimedia.org/P52152 and previous config saved to /var/cache/conftool/dbconfig/20230831-082440-ladsgroup.json [08:24:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:24:44] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) @jbond from Juniper, does it make sens? > “If the customer would like to use OIDC they enter in their token for us to use and authenticate. The vast majority of users sign... [08:24:46] slyngs: hi, please file whatever request in Phabricator against #gerrit :-) [08:24:46] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:24:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:24:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:25:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:25:02] hashar: Will do, thank you [08:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T343718)', diff saved to https://phabricator.wikimedia.org/P52153 and previous config saved to /var/cache/conftool/dbconfig/20230831-082508-ladsgroup.json [08:25:30] RECOVERY - cassandra-b service on restbase1030 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:25:43] the sign up link is https://wikitech.wikimedia.org/w/index.php?title=Special:CreateAccount&returnto=Gerrit/NewUser and it is defined somewhere in operations/puppet under modules/gerrit [08:26:16] and that URL is probably used in various on wikis documentation ( mw:Git and subpages come to mind ) [08:26:47] Ah, okay, I can do the patch and Phabricator task then [08:27:17] and potentially we could get Gerrit to migrate to OAUTH / SAML instead of talking to LDAP directly, but that is a side track :-) [08:27:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343718)', diff saved to https://phabricator.wikimedia.org/P52154 and previous config saved to /var/cache/conftool/dbconfig/20230831-082717-ladsgroup.json [08:27:33] (03CR) 10Ladsgroup: [C: 03+2] ores-extension: fix arwiki likelybad threshold [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953962 (https://phabricator.wikimedia.org/T345305) (owner: 10Ilias Sarantopoulos) [08:27:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953962 (https://phabricator.wikimedia.org/T345305) (owner: 10Ilias Sarantopoulos) [08:28:11] (03Merged) 10jenkins-bot: ores-extension: fix arwiki likelybad threshold [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953962 (https://phabricator.wikimedia.org/T345305) (owner: 10Ilias Sarantopoulos) [08:28:37] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:953962|ores-extension: fix arwiki likelybad threshold (T345305)]] [08:28:42] T345305: MWException: Default '"soft"' is invalid for preference oresDamagingPref of most users - https://phabricator.wikimedia.org/T345305 [08:28:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr4-ulsfo [08:28:55] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr1-eqiad [08:28:56] (03CR) 10David Caro: replica_cnf_api: add envvars backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [08:30:02] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1012.eqiad.wmnet [08:30:14] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1013.eqiad.wmnet [08:32:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P52155 and previous config saved to /var/cache/conftool/dbconfig/20230831-083211-ladsgroup.json [08:33:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-eqiad [08:33:35] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr1-esams [08:33:36] 08:31:33 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2023-08-31-082844-publish (ran as mwdeploy@kubernetes1008.eqiad.wmnet) returned [255]: ssh: connect to host kubernetes1008.eqiad.wmnet port 22: Connection timed out [08:36:12] !log ladsgroup@deploy1002 ladsgroup and isaranto: Backport for [[gerrit:953962|ores-extension: fix arwiki likelybad threshold (T345305)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:36:18] T345305: MWException: Default '"soft"' is invalid for preference oresDamagingPref of most users - https://phabricator.wikimedia.org/T345305 [08:36:27] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1013.eqiad.wmnet [08:36:54] !log ladsgroup@deploy1002 ladsgroup and isaranto: Continuing with sync [08:37:04] confirmed it fixes the issue [08:38:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-esams [08:38:24] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-codfw [08:38:37] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [08:38:42] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [08:39:20] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1014.eqiad.wmnet [08:40:16] (03CR) 10Elukey: [C: 03+2] ml-services: tune knative's container concurrency settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/953578 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [08:40:39] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: Support loading of local gems [puppet] - 10https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [08:40:48] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [08:40:53] (03PS1) 10Ayounsi: Enable GNMI on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/953963 (https://phabricator.wikimedia.org/T316544) [08:41:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [08:41:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:42:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P52156 and previous config saved to /var/cache/conftool/dbconfig/20230831-084224-ladsgroup.json [08:42:31] (03PS1) 10Ayounsi: gNMI: collect data from core routers [puppet] - 10https://gerrit.wikimedia.org/r/953964 (https://phabricator.wikimedia.org/T326322) [08:42:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-codfw [08:43:01] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-drmrs [08:44:51] (03PS5) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) [08:45:54] (03PS2) 10Ayounsi: gNMI: collect data from core routers [puppet] - 10https://gerrit.wikimedia.org/r/953964 (https://phabricator.wikimedia.org/T326322) [08:46:37] (03PS6) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) [08:47:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P52157 and previous config saved to /var/cache/conftool/dbconfig/20230831-084717-ladsgroup.json [08:47:20] Amir1: The connection timed out probably because we're rebooting the k8s server right now,. That's the pre-pull of the image on all k8s host failing [08:47:40] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-drmrs [08:47:42] noted [08:47:42] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-eqdfw [08:47:50] (03PS36) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [08:48:02] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953964 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [08:48:38] (03PS7) 10Volans: Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:48:58] (03CR) 10Slyngshede: [C: 03+2] C:idm:deployment link to runbook. [puppet] - 10https://gerrit.wikimedia.org/r/931879 (https://phabricator.wikimedia.org/T338008) (owner: 10Slyngshede) [08:49:12] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:50:36] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1014.eqiad.wmnet [08:50:41] (03CR) 10JMeybohm: "Nice! But I think this it not how sextant works currently. AIUI it considers minor version changes incompatible/not backwards compatible a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 (owner: 10Alexandros Kosiaris) [08:51:11] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:51:33] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [08:51:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/953675 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [08:51:57] (03CR) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:52:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqdfw [08:52:19] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-eqiad [08:52:28] (03CR) 10Muehlenhoff: firewall: move conntrack logic to firewall module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [08:52:30] (03PS37) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [08:52:35] (03CR) 10Volans: "PCC fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:52:37] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1015.eqiad.wmnet [08:54:58] (03CR) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:56:02] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:56:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [08:56:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqiad [08:56:55] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-eqord [08:57:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P52158 and previous config saved to /var/cache/conftool/dbconfig/20230831-085731-ladsgroup.json [08:57:33] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf,ops for arnaudb - https://phabricator.wikimedia.org/T345241 (10jcrespo) 05Open→03Resolved [08:59:12] (03CR) 10JMeybohm: [C: 03+2] jaeger: Configure ingress using istio CRD [deployment-charts] - 10https://gerrit.wikimedia.org/r/953675 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [09:00:58] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10jbond) >>! In T158757#9132594, @nshahquinn-wmf wrote: > FYI, Urllib3 version 2, released in April 2023, [removed the fallback from serverAltName to commonName](https://... [09:01:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqord [09:01:30] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-eqsin [09:01:41] (03Merged) 10jenkins-bot: jaeger: Configure ingress using istio CRD [deployment-charts] - 10https://gerrit.wikimedia.org/r/953675 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [09:02:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T343718)', diff saved to https://phabricator.wikimedia.org/P52159 and previous config saved to /var/cache/conftool/dbconfig/20230831-090223-ladsgroup.json [09:02:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [09:02:31] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:02:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [09:02:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T343718)', diff saved to https://phabricator.wikimedia.org/P52160 and previous config saved to /var/cache/conftool/dbconfig/20230831-090244-ladsgroup.json [09:03:55] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1015.eqiad.wmnet [09:06:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-eqsin [09:06:39] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr2-esams [09:11:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr2-esams [09:11:23] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr3-eqsin [09:11:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:12:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343718)', diff saved to https://phabricator.wikimedia.org/P52161 and previous config saved to /var/cache/conftool/dbconfig/20230831-091237-ladsgroup.json [09:12:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:12:43] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:12:48] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1016.eqiad.wmnet [09:12:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:12:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T343718)', diff saved to https://phabricator.wikimedia.org/P52162 and previous config saved to /var/cache/conftool/dbconfig/20230831-091258-ladsgroup.json [09:13:11] 10SRE, 10Bitu, 10Infrastructure-Foundations: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10SLyngshede-WMF) [09:14:17] (03CR) 10JMeybohm: [C: 03+2] Depend mesh.configuration:1.4 on mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953630 (owner: 10Alexandros Kosiaris) [09:14:57] (03Merged) 10jenkins-bot: Depend mesh.configuration:1.4 on mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/953630 (owner: 10Alexandros Kosiaris) [09:15:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343718)', diff saved to https://phabricator.wikimedia.org/P52163 and previous config saved to /var/cache/conftool/dbconfig/20230831-091507-ladsgroup.json [09:15:10] (03CR) 10JMeybohm: [C: 03+1] service: add media-analytics service entry [puppet] - 10https://gerrit.wikimedia.org/r/951901 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [09:15:45] (03CR) 10JMeybohm: [C: 03+1] cassandra-http-gateway: use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953666 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [09:16:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr3-eqsin [09:16:30] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:17:40] (03PS1) 10Slyngshede: C:gerrit Link account creation to IDM. [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) [09:18:45] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/953960 (owner: 10Muehlenhoff) [09:21:58] (03PS1) 10Jelto: gitlab: enable local_gems in devtools test instance [puppet] - 10https://gerrit.wikimedia.org/r/953968 (https://phabricator.wikimedia.org/T337570) [09:22:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T343718)', diff saved to https://phabricator.wikimedia.org/P52164 and previous config saved to /var/cache/conftool/dbconfig/20230831-092231-ladsgroup.json [09:22:38] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:23:38] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) >>! In T306238#9132987, @ayounsi wrote: > @jbond from Juniper, does it make sens? >> “If the customer would like to use OIDC they enter in their token for us to use and authe... [09:24:16] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1016.eqiad.wmnet [09:24:23] (03CR) 10JMeybohm: [C: 03+1] thumbor: Update dependencies to be ready for cert manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/953667 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [09:25:24] (03Abandoned) 10Jbond: firewall: move conntrack logic to firewall module [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [09:25:35] (03Abandoned) 10Jbond: firewall: add conntrack require on the active firewall [puppet] - 10https://gerrit.wikimedia.org/r/953610 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [09:26:37] (03CR) 10Jelto: [C: 03+2] gitlab: enable local_gems in devtools test instance [puppet] - 10https://gerrit.wikimedia.org/r/953968 (https://phabricator.wikimedia.org/T337570) (owner: 10Jelto) [09:26:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:26:54] (03PS1) 10Ayounsi: Prometheus: scrape gNMIc endpoint [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) [09:27:21] (03PS1) 10Cathal Mooney: Move Juniper temp ztp password from installserver to apt_repo [labs/private] - 10https://gerrit.wikimedia.org/r/953971 (https://phabricator.wikimedia.org/T336485) [09:27:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:28:08] (03PS2) 10Ayounsi: Prometheus: scrape gNMIc endpoint [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) [09:28:16] (03PS13) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [09:28:34] (03PS3) 10Ayounsi: Prometheus: scrape gNMIc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) [09:29:44] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [09:30:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P52165 and previous config saved to /var/cache/conftool/dbconfig/20230831-093013-ladsgroup.json [09:30:29] (03PS14) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [09:30:44] (03CR) 10Ayounsi: [C: 03+2] gNMI: collect data from core routers [puppet] - 10https://gerrit.wikimedia.org/r/953964 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [09:30:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [09:30:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:30:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:44] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43073/console" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [09:32:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:33:10] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:32] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1017.eqiad.wmnet [09:35:09] !log imported cas 6.6.11+wmf11u1 to apt.wikimedia.org [09:35:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:49] Checking mw-web [09:35:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [09:36:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:37:34] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:953962|ores-extension: fix arwiki likelybad threshold (T345305)]] (duration: 68m 57s) [09:37:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P52166 and previous config saved to /var/cache/conftool/dbconfig/20230831-093738-ladsgroup.json [09:37:40] T345305: MWException: Default '"soft"' is invalid for preference oresDamagingPref of most users - https://phabricator.wikimedia.org/T345305 [09:37:46] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-codfw.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [09:37:50] (03PS4) 10Ayounsi: Prometheus: scrape gNMIc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) [09:38:04] (03PS1) 10Slyngshede: Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972 [09:38:21] (03CR) 10Filippo Giunchedi: [C: 03+1] Prometheus: scrape gNMIc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [09:38:45] 09:37:34 Finished scap: Backport for [[gerrit:953962|ores-extension: fix arwiki likelybad threshold (T345305)]] (duration: 68m 57s) [09:38:59] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) >>! In T345273#9132938, @ayounsi wrote: > FYI there is now a pending diff for: > ` > [edit forwarding-options dhcp-relay] > +... [09:40:00] (HelmReleaseBadStatus) firing: (2) Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:40:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:40:17] Not sure why it fired [09:40:43] Amir1: Did your scap deploy fail? Do you want me to redeploy mw-api-ext? [09:40:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [09:41:00] I can re-do it if the reboots are done [09:41:07] Did you see other releases fail? [09:41:14] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:41:23] (03CR) 10Ayounsi: [C: 03+2] Prometheus: scrape gNMIc endpoints [puppet] - 10https://gerrit.wikimedia.org/r/953969 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [09:41:30] Amir1: checking reboot status [09:41:55] eqiad: Deployment of mw-api-int-canary failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=canary', 'apply']' returned non-zero exit status 1. [09:42:18] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable lift wing for fiwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308) [09:42:58] (03PS1) 10JMeybohm: jaeger: Fix networkpolicy (indentation) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953974 (https://phabricator.wikimedia.org/T344253) [09:44:03] (03CR) 10Jbond: "I have abandoned this and the other change and restored https://gerrit.wikimedia.org/r/c/operations/puppet/+/952889/12" [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [09:45:10] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host snapshot1017.eqiad.wmnet [09:45:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P52167 and previous config saved to /var/cache/conftool/dbconfig/20230831-094520-ladsgroup.json [09:45:34] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:45:45] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1001.eqiad.wmnet [09:45:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [09:45:50] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:45:51] (03CR) 10Jbond: Modify Juniper ZTP script used during initial provision (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [09:45:56] ariel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [09:46:38] (03CR) 10JMeybohm: mesh: add tracing support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:46:54] (03CR) 10JMeybohm: [C: 03+2] jaeger: Fix networkpolicy (indentation) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953974 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [09:47:18] (03PS1) 10Cathal Mooney: Do not add DHCP exception for unconfigured ints on L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/953975 (https://phabricator.wikimedia.org/T345273) [09:47:34] Amir1: They're still running but as long as we don't re-run the deployment code will be out of sync between bare metal and mw-on-k8s [09:47:39] (03Merged) 10jenkins-bot: jaeger: Fix networkpolicy (indentation) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953974 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [09:47:42] I'll try and push it through [09:48:19] sure [09:48:24] (03CR) 10Cathal Mooney: Modify Juniper ZTP script used during initial provision (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [09:49:52] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [09:49:55] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [09:50:44] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:50:48] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:50:50] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [09:50:53] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: sync [09:50:59] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: sync [09:51:11] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1001.eqiad.wmnet [09:51:29] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1002.eqiad.wmnet [09:51:42] (03PS1) 10Ayounsi: Prometheus: gnmi re-label fix [puppet] - 10https://gerrit.wikimedia.org/r/953976 [09:51:42] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:51:59] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:52:24] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:52:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:30] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:52:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 59, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:52:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P52168 and previous config saved to /var/cache/conftool/dbconfig/20230831-095244-ladsgroup.json [09:52:46] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:52:48] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: sync [09:52:53] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync [09:53:14] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:53:46] Amir1: k, should be all good [09:53:56] awesome.thanks [09:54:10] can you let me know once the reboots are over? [09:54:12] (03CR) 10Ayounsi: [C: 03+2] Prometheus: gnmi re-label fix [puppet] - 10https://gerrit.wikimedia.org/r/953976 (owner: 10Ayounsi) [09:54:19] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:55:00] (HelmReleaseBadStatus) resolved: (2) Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:56:47] (03CR) 10Hnowlan: [C: 03+2] cassandra-http-gateway: use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953666 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [09:56:52] (03CR) 10Muehlenhoff: "Looks good, two comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [09:57:15] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr1-codfw [09:57:37] (03Merged) 10jenkins-bot: cassandra-http-gateway: use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/953666 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [09:58:39] Amir1: sure :) [09:59:11] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1002.eqiad.wmnet [09:59:14] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I think the problem I'm experiencing can be addressed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/953685 in a less invasive " [puppet] - 10https://gerrit.wikimedia.org/r/953595 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [09:59:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 1m 40s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [10:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1000) [10:00:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343718)', diff saved to https://phabricator.wikimedia.org/P52169 and previous config saved to /var/cache/conftool/dbconfig/20230831-100026-ladsgroup.json [10:00:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:00:37] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:00:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:00:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52170 and previous config saved to /var/cache/conftool/dbconfig/20230831-100047-ladsgroup.json [10:00:57] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1004.eqiad.wmnet [10:01:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:06] (03CR) 10Jbond: [V: 03+1] "updated" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:01:12] (03CR) 10Ayounsi: [C: 03+1] Do not add DHCP exception for unconfigured ints on L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/953975 (https://phabricator.wikimedia.org/T345273) (owner: 10Cathal Mooney) [10:01:33] (03PS4) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) [10:01:44] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:46] (03CR) 10CI reject: [V: 04-1] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [10:01:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-codfw [10:02:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:02:49] (03CR) 10JMeybohm: [C: 04-1] hieradata: add jaeger collector to service catalog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [10:02:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52171 and previous config saved to /var/cache/conftool/dbconfig/20230831-100256-ladsgroup.json [10:03:10] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 8m 17s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [10:05:23] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/953963 (https://phabricator.wikimedia.org/T316544) (owner: 10Ayounsi) [10:06:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Ladsgroup) What kind of analytics data you need access to? https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Level... [10:07:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:07:30] (03PS5) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) [10:07:44] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1004.eqiad.wmnet [10:07:48] (03CR) 10Cathal Mooney: [C: 03+1] Only advertise local customers to external peers [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) (owner: 10Ayounsi) [10:07:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T343718)', diff saved to https://phabricator.wikimedia.org/P52172 and previous config saved to /var/cache/conftool/dbconfig/20230831-100750-ladsgroup.json [10:07:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:07:53] (03CR) 10Ladsgroup: [C: 03+1] "I'll deploy it once the k8s reboots are done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [10:07:59] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:08:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:08:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T343718)', diff saved to https://phabricator.wikimedia.org/P52173 and previous config saved to /var/cache/conftool/dbconfig/20230831-100811-ladsgroup.json [10:08:15] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1005.eqiad.wmnet [10:10:59] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [10:11:41] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [10:15:12] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1005.eqiad.wmnet [10:15:15] (03PS2) 10Ladsgroup: admin: Add Mabualruz to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/953565 (https://phabricator.wikimedia.org/T342535) [10:16:25] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [10:16:29] (03CR) 10Cathal Mooney: [C: 04-1] "Manual record is fine but we should remove it from Netbox if we want to do that, see comment inline probably best to leave this handled fr" [dns] - 10https://gerrit.wikimedia.org/r/936236 (https://phabricator.wikimedia.org/T341220) (owner: 10Arturo Borrero Gonzalez) [10:16:43] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cr1-drmrs [10:17:19] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [10:17:50] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:18:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P52174 and previous config saved to /var/cache/conftool/dbconfig/20230831-101802-ladsgroup.json [10:18:44] (03PS3) 10Cathal Mooney: Change hierdata parents for leaf switches eqiad row F [puppet] - 10https://gerrit.wikimedia.org/r/928056 (https://phabricator.wikimedia.org/T322937) [10:19:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/953565 (https://phabricator.wikimedia.org/T342535) (owner: 10Ladsgroup) [10:19:47] (03PS1) 10Clément Goubert: mw-api-ext: Raise number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/953979 [10:20:10] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [10:20:47] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [10:21:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cr1-drmrs [10:21:30] ayounsi@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [10:21:30] (03CR) 10Cathal Mooney: [C: 03+2] Change hierdata parents for leaf switches eqiad row F [puppet] - 10https://gerrit.wikimedia.org/r/928056 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [10:21:42] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet [10:22:45] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001" [10:22:59] 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) [10:23:15] (03PS2) 10Cathal Mooney: Adjust network prepare-upgrade cookbook to use TCP 8080 [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) [10:23:36] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [10:23:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001" [10:23:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:23:37] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [10:23:55] 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10MatthewVernon) [I spoke to @KOfori about this, and they suggested opening a phab task tagged traffic was the best next step] [10:24:11] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [10:25:18] (03PS3) 10Ladsgroup: admin: Add Mabualruz to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/953565 (https://phabricator.wikimedia.org/T342535) [10:25:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka main-eqiad cluster: Reboot kafka nodes [10:25:22] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add Mabualruz to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/953565 (https://phabricator.wikimedia.org/T342535) (owner: 10Ladsgroup) [10:26:22] (03PS3) 10Cathal Mooney: Adjust network prepare-upgrade cookbook to use TCP 8080 [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) [10:26:48] (03CR) 10Cathal Mooney: Adjust network prepare-upgrade cookbook to use TCP 8080 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [10:27:40] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet [10:28:01] (03CR) 10Slyngshede: [C: 03+2] Facter: Python version [puppet] - 10https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [10:28:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T343718)', diff saved to https://phabricator.wikimedia.org/P52175 and previous config saved to /var/cache/conftool/dbconfig/20230831-102813-ladsgroup.json [10:28:19] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [10:28:21] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:30:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43076/console" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:31:07] (03CR) 10Jbond: [C: 03+1] "lgtm https://puppet-compiler.wmflabs.org/output/953674/43076/" [labs/private] - 10https://gerrit.wikimedia.org/r/953971 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:31:11] (03PS6) 10Cathal Mooney: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) [10:31:19] (03CR) 10Jbond: [C: 03+1] "lgtm https://puppet-compiler.wmflabs.org/output/953674/43076/" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:31:35] (03CR) 10Muehlenhoff: firewall: move conntrack logic to firewall module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:33:03] (03CR) 10Cathal Mooney: [C: 03+2] Move Juniper temp ztp password from installserver to apt_repo [labs/private] - 10https://gerrit.wikimedia.org/r/953971 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:33:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [10:33:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P52176 and previous config saved to /var/cache/conftool/dbconfig/20230831-103308-ladsgroup.json [10:33:17] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Move Juniper temp ztp password from installserver to apt_repo [labs/private] - 10https://gerrit.wikimedia.org/r/953971 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:33:57] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:21] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [10:34:50] (03CR) 10Cathal Mooney: [C: 03+2] Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [10:35:17] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:35:22] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [10:35:38] (03CR) 10Jon Harald Søby: [C: 04-1] "The "Á" at the end of wikipedia-wordmark-tly.svg looks weird, like it's bene squished to make the entire letter fit the height of the "V"." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953751 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [10:36:05] (03Merged) 10jenkins-bot: Add alert for server-side NIC errors [alerts] - 10https://gerrit.wikimedia.org/r/915489 (https://phabricator.wikimedia.org/T335350) (owner: 10Cathal Mooney) [10:37:33] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [10:38:01] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [10:38:32] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:57] (03CR) 10Muehlenhoff: ferm: add ensure support to the ferm class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:39:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1017.eqiad.wmnet with reason: Maintenance [10:39:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1017.eqiad.wmnet with reason: Maintenance [10:39:58] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:18] RECOVERY - MariaDB memory on clouddb1017 is OK: OK Memory 0% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:40:51] (03PS1) 10Muehlenhoff: Failover IDP to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/953980 [10:41:46] !log installing cjose security updates [10:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:57] (03CR) 10Arturo Borrero Gonzalez: [V: 04-1] "PCC not as expected: https://puppet-compiler.wmflabs.org/output/953685/43074/" [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [10:42:15] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [10:42:38] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [10:43:02] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [10:43:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P52177 and previous config saved to /var/cache/conftool/dbconfig/20230831-104319-ladsgroup.json [10:43:26] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [10:44:04] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:44:08] (03CR) 10Cathal Mooney: [C: 03+2] Modify Juniper ZTP script used during initial provision [puppet] - 10https://gerrit.wikimedia.org/r/953674 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:44:44] (03PS1) 10Muehlenhoff: Add library hint for cjose [puppet] - 10https://gerrit.wikimedia.org/r/953981 [10:44:48] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [10:45:43] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: use modern recursor setting for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240) [10:46:14] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: use modern recursor setting for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240) [10:46:27] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [10:46:53] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1006.eqiad.wmnet [10:47:00] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [10:47:09] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [10:47:22] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@90f280e]: (no justification provided) [10:47:31] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@90f280e]: (no justification provided) (duration: 00m 09s) [10:48:01] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for cjose [puppet] - 10https://gerrit.wikimedia.org/r/953981 (owner: 10Muehlenhoff) [10:48:13] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [10:48:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52178 and previous config saved to /var/cache/conftool/dbconfig/20230831-104815-ladsgroup.json [10:48:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:48:22] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:48:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:48:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T343718)', diff saved to https://phabricator.wikimedia.org/P52179 and previous config saved to /var/cache/conftool/dbconfig/20230831-104836-ladsgroup.json [10:48:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: pdns: use modern recursor setting for cloudservices1006 [puppet] - 10https://gerrit.wikimedia.org/r/953685 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [10:48:54] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [10:49:03] (03PS15) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [10:49:51] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [10:50:00] !log installing flask security updates on buster [10:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:20] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [10:50:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343718)', diff saved to https://phabricator.wikimedia.org/P52180 and previous config saved to /var/cache/conftool/dbconfig/20230831-105044-ladsgroup.json [10:50:45] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [10:50:51] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [10:50:52] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [10:51:19] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [10:51:46] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:52:07] (03PS16) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [10:52:12] (03CR) 10Jbond: "fixed" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:52:55] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [10:53:10] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:53:44] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [10:53:44] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:53:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43077/console" [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:54:20] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:54:39] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:54:43] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1006.eqiad.wmnet [10:55:40] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:23] (03CR) 10Cathal Mooney: [C: 03+2] Do not add DHCP exception for unconfigured ints on L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/953975 (https://phabricator.wikimedia.org/T345273) (owner: 10Cathal Mooney) [10:56:54] (03Merged) 10jenkins-bot: Do not add DHCP exception for unconfigured ints on L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/953975 (https://phabricator.wikimedia.org/T345273) (owner: 10Cathal Mooney) [10:57:04] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:23] (03PS1) 10Hnowlan: device-analytics: use global AQS configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/953982 (https://phabricator.wikimedia.org/T320967) [10:58:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P52181 and previous config saved to /var/cache/conftool/dbconfig/20230831-105826-ladsgroup.json [11:01:11] (03CR) 10FNegri: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/952848 (owner: 10Muehlenhoff) [11:01:41] 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10Vgutierrez) Happy to provide assistance and guidance if needed but caching is technically controlled by the backend services and not by the CDN. the CDN imp... [11:01:57] (SystemdUnitFailed) firing: elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P52182 and previous config saved to /var/cache/conftool/dbconfig/20230831-110551-ladsgroup.json [11:06:59] !log jayme@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubernetes1025.eqiad.wmnet [11:06:59] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes1025.eqiad.wmnet [11:07:43] (03PS1) 10Jbond: run-puppet-agent: drop deprecated ignorecache switch [puppet] - 10https://gerrit.wikimedia.org/r/953985 (https://phabricator.wikimedia.org/T341496) [11:08:00] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a1-codfw.mgmt.codfw.wmnet [11:08:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:52] (03CR) 10CI reject: [V: 04-1] run-puppet-agent: drop deprecated ignorecache switch [puppet] - 10https://gerrit.wikimedia.org/r/953985 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [11:13:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T343718)', diff saved to https://phabricator.wikimedia.org/P52183 and previous config saved to /var/cache/conftool/dbconfig/20230831-111332-ladsgroup.json [11:13:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [11:13:38] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:13:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [11:13:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T343718)', diff saved to https://phabricator.wikimedia.org/P52184 and previous config saved to /var/cache/conftool/dbconfig/20230831-111353-ladsgroup.json [11:15:06] (03PS2) 10Slyngshede: Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972 [11:15:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:10] (03PS2) 10Jbond: run-puppet-agent: drop deprecated ignorecache switch [puppet] - 10https://gerrit.wikimedia.org/r/953985 (https://phabricator.wikimedia.org/T341496) [11:17:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks! PCC is also fine." [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [11:19:20] (03PS1) 10Jbond: puppet: drop deprecated ignorecache switch [software/spicerack] - 10https://gerrit.wikimedia.org/r/953990 (https://phabricator.wikimedia.org/T341496) [11:20:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P52185 and previous config saved to /var/cache/conftool/dbconfig/20230831-112057-ladsgroup.json [11:24:54] (03CR) 10Hnowlan: [C: 03+2] thumbor: Update dependencies to be ready for cert manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/953667 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [11:25:43] (03Merged) 10jenkins-bot: thumbor: Update dependencies to be ready for cert manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/953667 (https://phabricator.wikimedia.org/T300033) (owner: 10Hnowlan) [11:27:01] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) 05Open→03Resolved [11:27:06] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) [11:30:32] 10SRE-tools, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) p:05Triage→03Medium [11:31:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [11:31:35] (03CR) 10Majavah: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/953577/43058/" [puppet] - 10https://gerrit.wikimedia.org/r/953577 (owner: 10Majavah) [11:31:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T343718)', diff saved to https://phabricator.wikimedia.org/P52186 and previous config saved to /var/cache/conftool/dbconfig/20230831-113136-ladsgroup.json [11:31:42] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:32:25] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet [11:32:25] (03CR) 10Muehlenhoff: [C: 03+2] Openstack: remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952848 (owner: 10Muehlenhoff) [11:33:00] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/953980 (owner: 10Muehlenhoff) [11:33:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132', diff saved to https://phabricator.wikimedia.org/P52187 and previous config saved to /var/cache/conftool/dbconfig/20230831-113324-root.json [11:33:27] Amir1: eqiad k8s reboots done, give a few minutes to jayme so he can reboot the masters and you're good to keep deploying [11:34:29] (03PS1) 10Kosta Harlan: Add ReportIncident extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275) [11:34:31] (03PS1) 10Kosta Harlan: ReportIncident: Default deployment to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953999 (https://phabricator.wikimedia.org/T339275) [11:35:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1119 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/953555 (https://phabricator.wikimedia.org/T339835) (owner: 10Marostegui) [11:35:25] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) [11:35:39] moritzm: ok to merge your change? [11:36:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343718)', diff saved to https://phabricator.wikimedia.org/P52189 and previous config saved to /var/cache/conftool/dbconfig/20230831-113603-ladsgroup.json [11:36:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:36:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:36:08] marostegui: yes, please [11:36:14] moritzm: done [11:36:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T343718)', diff saved to https://phabricator.wikimedia.org/P52190 and previous config saved to /var/cache/conftool/dbconfig/20230831-113613-ladsgroup.json [11:36:34] cheers [11:37:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:45] (03CR) 10JMeybohm: [C: 03+1] mesh: new configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [11:38:06] (03CR) 10JMeybohm: [C: 03+1] mw-api-ext: Raise number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/953979 (owner: 10Clément Goubert) [11:38:23] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) Also worth noting that version >= 6 are not currently working with spicerack (T328775) [11:39:02] (03CR) 10Clément Goubert: [C: 03+2] mw-api-ext: Raise number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/953979 (owner: 10Clément Goubert) [11:39:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T343718)', diff saved to https://phabricator.wikimedia.org/P52191 and previous config saved to /var/cache/conftool/dbconfig/20230831-113922-ladsgroup.json [11:39:32] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:39:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:39:47] (03Merged) 10jenkins-bot: mw-api-ext: Raise number of canary replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/953979 (owner: 10Clément Goubert) [11:39:53] (03CR) 10JMeybohm: "@jbond can you maybe take a look please?" [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341669) (owner: 10JMeybohm) [11:40:32] (03CR) 10Kosta Harlan: [C: 04-2] "Wait until code is present on all branches running in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953998 (https://phabricator.wikimedia.org/T339275) (owner: 10Kosta Harlan) [11:40:33] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:40:47] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:43:29] claime: thanks! [11:44:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:44:46] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubemaster1001.eqiad.wmnet [11:45:20] PROBLEM - Check systemd state on kubemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:44] (03PS1) 10Clément Goubert: mw-api-ext, mw-web: Raise total replicas to 13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954000 (https://phabricator.wikimedia.org/T341780) [11:46:34] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a8-codfw.mgmt.codfw.wmnet [11:46:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [11:46:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P52192 and previous config saved to /var/cache/conftool/dbconfig/20230831-114642-ladsgroup.json [11:46:52] RECOVERY - Check systemd state on kubemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:56] !log marostegui@cumin1001 START - Cookbook sre.mysql.clone of db1132.eqiad.wmnet onto db1119.eqiad.wmnet [11:47:11] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet [11:48:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:49:13] (03PS1) 10Majavah: Move WMCS haproxy scrapes to WMCS prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/954001 [11:49:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [11:50:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] ferm: add ensure support to the ferm class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [11:50:08] (03PS2) 10Majavah: Move WMCS haproxy scrapes to WMCS prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294) [11:51:15] (03PS1) 10Clément Goubert: mw-on-k8s: Raise traffic to 4% [puppet] - 10https://gerrit.wikimedia.org/r/954002 (https://phabricator.wikimedia.org/T341780) [11:51:35] (03PS2) 10Sohom Datta: Allow loading Edit-in-Sequence as a beta feature on Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952928 (https://phabricator.wikimedia.org/T308098) [11:52:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:31] (03PS1) 10JMeybohm: service::catalog: Move k8s-ingress-aux to lvs_setuo [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178) [11:54:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P52193 and previous config saved to /var/cache/conftool/dbconfig/20230831-115429-ladsgroup.json [11:55:21] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43079/console" [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [11:59:02] (03PS2) 10JMeybohm: service::catalog: Move k8s-ingress-aux to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178) [11:59:26] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubemaster1002.eqiad.wmnet [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1200) [12:00:20] (03PS3) 10Slyngshede: Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972 [12:00:23] Amir1: gogogo [12:00:56] :D [12:01:01] isaranto: shall we deploy? [12:01:16] (enabling LW in itwiki and so on) [12:01:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P52194 and previous config saved to /var/cache/conftool/dbconfig/20230831-120148-ladsgroup.json [12:02:18] !log About to deploy analytics refinery (weekly train) [12:02:20] Amir1: yes! I am here to test [12:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:32] jouncebot: nowandnext [12:02:33] For the next 0 hour(s) and 57 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1200) [12:02:33] In 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1300) [12:02:38] cool [12:02:50] (03CR) 10Ladsgroup: [C: 03+2] ores-extension: enable lift wing for fiwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [12:02:55] (03CR) 10JMeybohm: [C: 03+1] mw-api-ext, mw-web: Raise total replicas to 13 [deployment-charts] - 10https://gerrit.wikimedia.org/r/954000 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [12:03:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [12:03:21] !log aqu@deploy1002 Started deploy [analytics/refinery@06203c0]: Regular analytics weekly train [analytics/refinery@06203c0] [12:03:29] (03Merged) 10jenkins-bot: ores-extension: enable lift wing for fiwiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953973 (https://phabricator.wikimedia.org/T343308) (owner: 10Ilias Sarantopoulos) [12:03:44] (03CR) 10JMeybohm: [C: 03+1] mw-on-k8s: Raise traffic to 4% [puppet] - 10https://gerrit.wikimedia.org/r/954002 (https://phabricator.wikimedia.org/T341780) (owner: 10Clément Goubert) [12:03:56] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:953973|ores-extension: enable lift wing for fiwiki and itwiki (T343308)]] [12:04:01] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [12:04:04] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:40] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43080/console" [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178) (owner: 10JMeybohm) [12:05:08] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308139) [12:05:11] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] service::catalog: Move k8s-ingress-aux to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178) (owner: 10JMeybohm) [12:05:34] !log ladsgroup@deploy1002 isaranto and ladsgroup: Backport for [[gerrit:953973|ores-extension: enable lift wing for fiwiki and itwiki (T343308)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [12:05:54] isaranto: it's live in mwdebug [12:06:14] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:52] jouncebot: nowandnext [12:06:52] For the next 0 hour(s) and 53 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1200) [12:06:52] In 0 hour(s) and 53 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1300) [12:07:42] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:46] vgutierrez: I'm deploying :D [12:09:19] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet [12:09:30] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a1-codfw.mgmt.codfw.wmnet [12:09:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P52195 and previous config saved to /var/cache/conftool/dbconfig/20230831-120935-ladsgroup.json [12:09:41] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a2-codfw.mgmt.codfw.wmnet [12:09:49] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a3-codfw.mgmt.codfw.wmnet [12:09:58] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a4-codfw.mgmt.codfw.wmnet [12:10:01] (DatasourceError) firing: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [12:10:01] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a2-codfw.mgmt.codfw.wmnet [12:10:07] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a5-codfw.mgmt.codfw.wmnet [12:10:09] isaranto: are you testing? [12:10:09] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a3-codfw.mgmt.codfw.wmnet [12:10:15] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a4-codfw.mgmt.codfw.wmnet [12:10:27] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a7-codfw.mgmt.codfw.wmnet [12:10:29] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a6-codfw.mgmt.codfw.wmnet [12:10:37] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a7-codfw.mgmt.codfw.wmnet [12:10:37] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a6-codfw.mgmt.codfw.wmnet [12:10:39] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a5-codfw.mgmt.codfw.wmnet [12:10:41] (03PS4) 10Slyngshede: Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972 [12:10:48] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a8-codfw.mgmt.codfw.wmnet [12:10:51] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b2-codfw.mgmt.codfw.wmnet [12:10:53] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:10:59] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-b2-codfw.mgmt.codfw.wmnet [12:12:56] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:57] (03CR) 10Arturo Borrero Gonzalez: "maybe include the required firewalling changes in the same patch?" [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [12:13:45] Amir1: yes. I am getting a 503 when running a job for itwiki -> `Service failed to respond properly: Failed to make LiftWing request to [http://localhost:6031/v1/models/itwiki-damaging:predict], There was a problem during the HTTP request: 503 Service Unavailable` [12:14:19] (03PS3) 10Majavah: Move WMCS haproxy scrapes to WMCS prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294) [12:14:22] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:52] (DatasourceError) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [12:15:11] that seems to be a problem from LW [12:15:23] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308139) [12:15:32] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service,netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:33] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [12:15:35] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:15:37] !log aqu@deploy1002 Finished deploy [analytics/refinery@06203c0]: Regular analytics weekly train [analytics/refinery@06203c0] (duration: 12m 15s) [12:16:06] (03PS1) 10KartikMistry: Update MinT to 2023-08-31-061147-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954005 (https://phabricator.wikimedia.org/T336683) [12:16:15] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet [12:16:19] Amir1: checking from another host [12:16:21] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a1-codfw.mgmt.codfw.wmnet [12:16:30] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a4-codfw.mgmt.codfw.wmnet [12:16:32] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:16:34] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a5-codfw.mgmt.codfw.wmnet [12:16:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:16:39] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a6-codfw.mgmt.codfw.wmnet [12:16:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T343718)', diff saved to https://phabricator.wikimedia.org/P52196 and previous config saved to /var/cache/conftool/dbconfig/20230831-121654-ladsgroup.json [12:16:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:17:00] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:17:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:17:11] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-a6-codfw.mgmt.codfw.wmnet [12:17:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:17:13] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b5-codfw.mgmt.codfw.wmnet [12:17:14] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b6-codfw.mgmt.codfw.wmnet [12:17:14] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:17:14] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b4-codfw.mgmt.codfw.wmnet [12:17:14] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:17:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:17:15] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:17:17] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-b3-codfw.mgmt.codfw.wmnet [12:17:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T343718)', diff saved to https://phabricator.wikimedia.org/P52197 and previous config saved to /var/cache/conftool/dbconfig/20230831-121721-ladsgroup.json [12:17:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Move WMCS haproxy scrapes to WMCS prometheus instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954001 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [12:17:43] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:17:46] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device lsw1-b3-codfw.mgmt.codfw.wmnet [12:18:12] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device lsw1-b6-codfw.mgmt.codfw.wmnet [12:18:44] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:19:11] (03PS3) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308139) [12:19:20] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:19:26] I think I may have borked netbox running all those network.provision cookbooks in parallel [12:19:46] !log cmooney@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [12:19:53] (03CR) 10JMeybohm: [C: 03+2] service::catalog: Move k8s-ingress-aux to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/954003 (https://phabricator.wikimedia.org/T325178) (owner: 10JMeybohm) [12:20:01] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:20:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/953972 (owner: 10Slyngshede) [12:20:15] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=93) for device ssw1-a1-codfw.mgmt.codfw.wmnet [12:20:23] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:20:48] It triggered 16 x sre.dns.netbox cookbook executions in parallel after which netbox started to struggle [12:21:10] I've aborted/they've timed out, I'll do it serially instead [12:21:17] sorry for any problems [12:21:40] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [12:21:41] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:22:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:10] (03PS1) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [12:23:12] (03CR) 10Ayounsi: [C: 03+2] Only advertise local customers to external peers [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) (owner: 10Ayounsi) [12:23:20] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:23:27] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [12:23:35] Amir1: it works now, I don't know why. [12:23:44] (03Merged) 10jenkins-bot: Only advertise local customers to external peers [homer/public] - 10https://gerrit.wikimedia.org/r/947993 (https://phabricator.wikimedia.org/T334530) (owner: 10Ayounsi) [12:23:47] !log ladsgroup@deploy1002 isaranto and ladsgroup: Continuing with sync [12:23:57] I'll push it forward will see [12:24:07] There isnt an issue with LW. perhaps had to do with the envoy proxy [12:24:13] (03PS3) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) [12:24:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T343718)', diff saved to https://phabricator.wikimedia.org/P52198 and previous config saved to /var/cache/conftool/dbconfig/20230831-122441-ladsgroup.json [12:24:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [12:24:51] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:24:53] (DatasourceError) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [12:24:56] (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [12:24:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [12:24:59] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Lowercase email addresses. [software/bitu] - 10https://gerrit.wikimedia.org/r/953972 (owner: 10Slyngshede) [12:24:59] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [12:25:01] !log restarting pybal on lvs1020 - T325178 [12:25:01] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:25:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52199 and previous config saved to /var/cache/conftool/dbconfig/20230831-122502-ladsgroup.json [12:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:31] T325178: Add ingress to aux-k8s - https://phabricator.wikimedia.org/T325178 [12:25:36] (03PS4) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) [12:26:17] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a8-codfw.mgmt.codfw.wmnet [12:26:22] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:44] (03PS1) 10Stevemunene: idp: add datahub as oidc service [puppet] - 10https://gerrit.wikimedia.org/r/954009 (https://phabricator.wikimedia.org/T305874) [12:27:01] I'm not seeing anything so far in the log [12:27:04] *logs [12:27:14] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [12:27:54] !log restarting pybal on lvs1019 - T325178 [12:27:59] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-a1-codfw - cmooney@cumin1001" [12:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:59] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:28:26] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:28:48] (03PS2) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) [12:29:10] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:29:16] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:44] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:29:47] (03CR) 10Sergio Gimeno: "Scheduled September 6th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno) [12:29:53] (03CR) 10Sergio Gimeno: "Scheduled September 6th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno) [12:30:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:01] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:953973|ores-extension: enable lift wing for fiwiki and itwiki (T343308)]] (duration: 27m 05s) [12:31:06] (03PS1) 10Muehlenhoff: Add testreduce1002 [puppet] - 10https://gerrit.wikimedia.org/r/954010 (https://phabricator.wikimedia.org/T345220) [12:31:09] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [12:31:23] (03CR) 10Majavah: [C: 03+1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [12:31:41] (03PS4) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308138) [12:31:43] (03PS5) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) [12:31:45] (03PS3) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) [12:32:01] !log aqu@deploy1002 Started deploy [analytics/refinery@06203c0] (thin): Regular analytics weekly train THIN [analytics/refinery@06203c0] [12:32:06] !log aqu@deploy1002 Finished deploy [analytics/refinery@06203c0] (thin): Regular analytics weekly train THIN [analytics/refinery@06203c0] (duration: 00m 04s) [12:32:12] !log aqu@deploy1002 Started deploy [analytics/refinery@06203c0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@06203c0] [12:32:13] (03CR) 10Peter Fischer: "Thanks, the config parameters look a lot cleaner now! I haven't understood how and where they are actually passed to the application." [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [12:32:17] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm) >>! In T344164#9130881, @MoritzMuehlenhoff wrote: > From a high level view that seems perfectly fine. We initiate non-wiki offboardings from... [12:33:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:33:35] (03PS1) 10JMeybohm: service::catalog: Move k8s-ingress-aux to production [puppet] - 10https://gerrit.wikimedia.org/r/954011 (https://phabricator.wikimedia.org/T325178) [12:34:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T343718)', diff saved to https://phabricator.wikimedia.org/P52200 and previous config saved to /var/cache/conftool/dbconfig/20230831-123428-ladsgroup.json [12:34:34] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:35:20] !log aqu@deploy1002 Finished deploy [analytics/refinery@06203c0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@06203c0] (duration: 03m 07s) [12:35:30] (03PS2) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [12:35:55] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device ssw1-a8-codfw [12:36:05] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-a8-codfw [12:36:21] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [12:37:04] (03CR) 10JMeybohm: [C: 03+2] service::catalog: Move k8s-ingress-aux to production [puppet] - 10https://gerrit.wikimedia.org/r/954011 (https://phabricator.wikimedia.org/T325178) (owner: 10JMeybohm) [12:37:14] (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [12:38:14] (DatasourceError) firing: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [12:38:19] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10ABran-WMF) [12:38:57] (03PS1) 10Jbond: Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 [12:39:16] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:39:20] (03CR) 10Muehlenhoff: [C: 03+1] Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 (owner: 10Jbond) [12:39:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2044.codfw.wmnet [12:39:41] * Lucas_WMDE deploying now [12:39:46] (security fix) [12:39:48] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1044.eqiad.wmnet [12:40:58] (03CR) 10Jbond: [C: 03+2] Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 (owner: 10Jbond) [12:41:11] (03CR) 10CI reject: [V: 04-1] Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 (owner: 10Jbond) [12:41:38] (03PS2) 10Jbond: Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 [12:41:54] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) [12:41:58] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) [12:42:20] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-a1-codfw.mgmt.codfw.wmnet [12:42:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52201 and previous config saved to /var/cache/conftool/dbconfig/20230831-124240-ladsgroup.json [12:42:46] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:42:56] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10KOfori) This is approved. Thanks. [12:43:12] (03PS3) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [12:43:53] (03CR) 10CI reject: [V: 04-1] Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 (owner: 10Jbond) [12:44:24] (03PS1) 10Elukey: knative-serving: immediately clean up old revisions [deployment-charts] - 10https://gerrit.wikimedia.org/r/954047 (https://phabricator.wikimedia.org/T344058) [12:45:00] (03PS3) 10Jbond: Revert "ferm: add ensure support to the ferm class" [puppet] - 10https://gerrit.wikimedia.org/r/953653 [12:45:02] (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [12:46:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1044.eqiad.wmnet [12:47:06] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet [12:47:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2044.codfw.wmnet [12:47:34] !log lucaswerkmeister-wmde Deployed security patch for T345064 [12:47:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2045.codfw.wmnet [12:48:14] (DatasourceError) resolved: (2) - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [12:48:44] (still deploying, wmf.24 now) [12:48:59] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) [12:49:05] !log cmooney@cumin1001 START - Cookbook sre.network.tls for network device ssw1-a1-codfw [12:49:14] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-a1-codfw [12:49:28] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a1-codfw.mgmt.codfw.wmnet [12:49:30] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P52202 and previous config saved to /var/cache/conftool/dbconfig/20230831-124934-ladsgroup.json [12:50:29] (03PS1) 10Arlolra: Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365) [12:50:59] (03PS1) 10Arlolra: Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) [12:51:24] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:26] (03PS1) 10Anzx: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) [12:51:55] (03PS1) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) [12:51:57] (03CR) 10Ilias Sarantopoulos: [C: 03+1] knative-serving: immediately clean up old revisions [deployment-charts] - 10https://gerrit.wikimedia.org/r/954047 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [12:52:13] (03CR) 10Elukey: [C: 03+2] knative-serving: immediately clean up old revisions [deployment-charts] - 10https://gerrit.wikimedia.org/r/954047 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [12:52:50] (03CR) 10JMeybohm: [C: 04-1] "Oh, I forgot: You will have to add something to mesh.networkpolicy as well, allowing the pods to egress to the otel collector." [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [12:52:54] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:01] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) [12:53:16] (03Abandoned) 10Anzx: tlywiki: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953751 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [12:53:35] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a1-codfw - cmooney@cumin1001" [12:54:20] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet [12:54:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a1-codfw - cmooney@cumin1001" [12:54:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:54:53] (03CR) 10Stevemunene: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [12:54:54] !log lucaswerkmeister-wmde Deployed security patch for T345064 [12:55:11] * Lucas_WMDE done [12:55:20] (and probably won’t be around for the backport window in a few minutes, I’m afraid) [12:55:49] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet [12:55:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:57:31] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:57:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P52203 and previous config saved to /var/cache/conftool/dbconfig/20230831-125746-ladsgroup.json [12:58:57] (JobUnavailable) firing: (2) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:59:31] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1300). [13:00:04] gmodena, Sohom_Datta, sergi0, and arlolra: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] hello [13:00:27] hey hey [13:00:38] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:00:38] ^ joal [13:00:54] o/ [13:00:55] (03PS4) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [13:00:57] Ack gmodena [13:02:19] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) [13:02:42] !log Deployed refinery using scap, then deployed onto hdfs [13:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:04] (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [13:03:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2045.codfw.wmnet [13:04:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P52204 and previous config saved to /var/cache/conftool/dbconfig/20230831-130441-ladsgroup.json [13:05:17] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:06:14] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:06:43] (03PS5) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [13:06:48] (03CR) 10Jbond: "pcc is still failing but its complete enough to take a look" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [13:07:16] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:45] Is any deployer around? I can deploy otherwise [13:08:00] (03CR) 10Muehlenhoff: [C: 03+2] Add testreduce1002 [puppet] - 10https://gerrit.wikimedia.org/r/954010 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [13:08:24] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet [13:08:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:08:42] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:08:57] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:09:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:09:26] (03CR) 10CI reject: [V: 04-1] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [13:10:15] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) [13:10:21] (03CR) 10Arlolra: "recheck" [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [13:10:36] gmodena: do you need assistance for the backport? [13:11:38] sergi0 I should be able to test the change once it's deployed. [13:12:00] ok, starting with yours [13:12:09] sergi0 awesome, thanks [13:12:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:12:30] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/953960 (owner: 10Muehlenhoff) [13:12:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P52205 and previous config saved to /var/cache/conftool/dbconfig/20230831-131252-ladsgroup.json [13:13:05] (03PS6) 10Gehel: Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) [13:13:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [13:13:16] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:13:22] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [13:13:26] (03PS1) 10Majavah: team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) [13:13:58] (03Merged) 10jenkins-bot: Remove rc1.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959) (owner: 10Gmodena) [13:14:08] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:14:28] !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:951929|Remove rc1.mediawiki.page_content_change stream (T307959)]] [13:14:34] T307959: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 [13:14:46] (03CR) 10CI reject: [V: 04-1] team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [13:14:52] (03PS1) 10Filippo Giunchedi: prometheus: fix gnmi relabel [puppet] - 10https://gerrit.wikimedia.org/r/954053 (https://phabricator.wikimedia.org/T326322) [13:15:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:15:29] (03PS2) 10Majavah: team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) [13:15:50] 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T345266 (10phaultfinder) [13:16:03] !log sgimeno@deploy1002 gmodena and sgimeno: Backport for [[gerrit:951929|Remove rc1.mediawiki.page_content_change stream (T307959)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:16:29] gmodena: you can test the change in debug server [13:16:35] sergi0 ack [13:16:43] (03CR) 10Ayounsi: [C: 03+1] prometheus: fix gnmi relabel [puppet] - 10https://gerrit.wikimedia.org/r/954053 (https://phabricator.wikimedia.org/T326322) (owner: 10Filippo Giunchedi) [13:16:46] 10SRE, 10SRE-Access-Requests: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) @joanna_borun Asking for sign up of @Arnaud for global root production access as a new member of Data Persistence Team, as you are one of the people being able to approve that. Thank you! [13:17:14] sergi0 everything works as expected. [13:17:18] 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T345266 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm issue was resolved [13:17:28] !log sgimeno@deploy1002 gmodena and sgimeno: Continuing with sync [13:17:34] syncing [13:17:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2046.codfw.wmnet [13:17:44] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1047.eqiad.wmnet [13:18:38] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:18:46] Sohom_Datta: your patch is next, are you around? [13:19:05] yep yep [13:19:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T343718)', diff saved to https://phabricator.wikimedia.org/P52206 and previous config saved to /var/cache/conftool/dbconfig/20230831-131947-ladsgroup.json [13:19:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:19:53] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:20:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:20:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52207 and previous config saved to /var/cache/conftool/dbconfig/20230831-132009-ladsgroup.json [13:20:15] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1010.eqiad.wmnet with OS bullseye [13:20:29] (03CR) 10Jon Harald Søby: "This is not something that's wrong with the patch per se, but the most active contributor asked us to change the logo from "Vikipediya" to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [13:20:43] sergi0 many thanks for the help [13:21:18] gmodena: your patch is still syncing [13:21:31] sergi0 ack [13:22:08] (03PS1) 10Elukey: knative-serving: increase failure-threshold for the webhook pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/954054 [13:22:15] (03PS4) 10Anzx: tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) [13:22:26] (03PS38) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [13:22:52] (03PS2) 10Elukey: knative-serving: increase failure-threshold for the webhook pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/954054 [13:23:04] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1047.eqiad.wmnet [13:23:11] (03CR) 10Anzx: tlywiki: add metanamespace , timezone, sitename (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [13:23:25] (03PS3) 10Elukey: knative-serving: increase failure-threshold for the webhook pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/954054 [13:23:38] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:52] (03PS6) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [13:24:09] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2046.codfw.wmnet [13:24:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2047.codfw.wmnet [13:24:42] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1048.eqiad.wmnet [13:25:01] !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:951929|Remove rc1.mediawiki.page_content_change stream (T307959)]] (duration: 10m 33s) [13:25:09] T307959: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content - https://phabricator.wikimedia.org/T307959 [13:25:17] gmodena: the change is live [13:25:30] sergi0 ack. All looks good. [13:25:33] thanks again [13:25:33] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a1-codfw.mgmt.codfw.wmnet [13:25:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952928 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [13:25:54] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [13:26:37] you are welcome [13:26:37] (03PS2) 10Filippo Giunchedi: prometheus: fix gnmi relabel [puppet] - 10https://gerrit.wikimedia.org/r/954053 (https://phabricator.wikimedia.org/T326322) [13:26:47] (03CR) 10Elukey: [C: 03+2] knative-serving: increase failure-threshold for the webhook pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/954054 (owner: 10Elukey) [13:27:15] (03Merged) 10jenkins-bot: Allow loading Edit-in-Sequence as a beta feature on Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952928 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [13:27:33] !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] [13:27:39] T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098 [13:27:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 13 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43086/console" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [13:27:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52208 and previous config saved to /var/cache/conftool/dbconfig/20230831-132759-ladsgroup.json [13:28:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [13:28:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1132.eqiad.wmnet onto db1119.eqiad.wmnet [13:28:05] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:28:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [13:28:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T343718)', diff saved to https://phabricator.wikimedia.org/P52209 and previous config saved to /var/cache/conftool/dbconfig/20230831-132820-ladsgroup.json [13:28:23] PROBLEM - MariaDB Replica Lag: s1 #page on db1132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6039.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:28:34] hello [13:28:38] (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:28:41] around [13:28:45] * Emperor here [13:28:46] depool? [13:28:57] here too [13:29:02] I'm around now [13:29:04] let me check [13:29:06] expired downtime? https://sal.toolforge.org/production?p=0&q=db1132&d= [13:29:06] Amir1: thanks [13:29:11] !log sgimeno@deploy1002 sgimeno and soda: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:29:19] first depool [13:29:26] doing [13:29:28] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:36] it's not pooled in the first place [13:29:56] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:29:58] from sal? [13:30:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2047.codfw.wmnet [13:30:14] at least it's not visible on https://noc.wikimedia.org/db.php [13:30:22] and SAL shows marostegui doing maintenance on it earlier today [13:30:28] Sohom_Datta: you can test [13:30:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T343718)', diff saved to https://phabricator.wikimedia.org/P52210 and previous config saved to /var/cache/conftool/dbconfig/20230831-133029-ladsgroup.json [13:30:37] nothing to commit [13:30:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix gnmi relabel [puppet] - 10https://gerrit.wikimedia.org/r/954053 (https://phabricator.wikimedia.org/T326322) (owner: 10Filippo Giunchedi) [13:30:44] good, it's not pooled [13:30:50] (03PS7) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [13:30:55] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:31:04] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device ssw1-e1-eqiad [13:31:15] the downtime should be for 24 hours [13:31:27] On it [13:31:40] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1010.eqiad.wmnet with reason: host reimage [13:31:44] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host doh2002.wikimedia.org with OS bookworm [13:32:18] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:21] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host doh2002.wikimedia.org with OS bookworm [13:32:42] sigh, I hate this thing with cookbooks [13:32:45] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:32:50] it downtimed the host for 48 hours [13:32:58] but removed the downtime once the clone was done [13:33:06] (03CR) 10CI reject: [V: 04-1] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [13:33:07] it has happened before [13:33:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device ssw1-e1-eqiad [13:33:27] Amir1: which cookbook? [13:33:33] (in another cookbook) [13:33:40] clone cookbook and upgrade [13:33:43] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:33:52] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [13:33:56] with self.alerting_hosts(hosts_to_downtime).downtimed(self.admin_reason, duration=timedelta(hours=48)): [13:34:10] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:34:19] sergi0: Looks good :) [13:34:26] Amir1: you can wait for icinga being optimal [13:34:27] Tested on enwikisource [13:34:28] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device ssw1-f1-eqiad [13:34:49] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [13:34:56] syncing [13:35:00] !log sgimeno@deploy1002 sgimeno and soda: Continuing with sync [13:35:02] or add any other check before exiting the context manager [13:35:31] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device ssw1-f1-eqiad [13:35:42] !log swap puppetdb-api and puppetdb-api-next gerrit:940384 [13:35:43] (03PS8) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [13:35:48] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:35:48] (03CR) 10Jbond: [C: 03+2] puppetdb-api: swap the production and next environments [puppet] - 10https://gerrit.wikimedia.org/r/940384 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [13:35:54] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wdqs1010.eqiad.wmnet with reason: host reimage [13:35:57] volans: any way to tell it not remove the downtime? [13:36:04] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:12] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:31] what's the problem? you can 1) check for icinga optimal before exiting the context manager so that when it exits icinga is all green [13:36:33] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1048.eqiad.wmnet [13:36:38] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (PATCH configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:36:50] 2) don't use the context manager and just set the downtime, paying the price it will be downtimed for longer [13:36:50] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:37:03] 3) add any custom check to ensure your host is happy before removing the donwtime [13:37:41] unforunutely there is no concept of "all optimal" in the alertmanager world [13:37:46] (Not accepting/receiving prefixes from anycast BGP peer) resolved: (2) Device cr1-codfw.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:37:55] (03PS1) 10Filippo Giunchedi: prometheus: drop 'cluster' for gnmi job [puppet] - 10https://gerrit.wikimedia.org/r/954055 (https://phabricator.wikimedia.org/T326322) [13:38:12] (03CR) 10AOkoth: vrts: apply role and setup hiera values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [13:38:16] I'll go with the second option [13:38:23] why not 1? [13:38:29] it's 2 lines of code [13:38:34] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2048.codfw.wmnet [13:38:38] (03PS9) 10Jbond: confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) [13:38:38] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1049.eqiad.wmnet [13:38:44] if the alert comes from icinga [13:38:53] (KubernetesAPILatency) resolved: (12) High Kubernetes API latency (PATCH configmaps) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:39:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52211 and previous config saved to /var/cache/conftool/dbconfig/20230831-133905-ladsgroup.json [13:39:08] because it could take even a day for the replica to catch up [13:39:11] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:39:43] (03CR) 10David Caro: [V: 03+1 C: 03+2] "\o/ working on tools also:" [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [13:39:55] Downtime expired [13:40:20] ack [13:40:33] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: drop 'cluster' for gnmi job [puppet] - 10https://gerrit.wikimedia.org/r/954055 (https://phabricator.wikimedia.org/T326322) (owner: 10Filippo Giunchedi) [13:40:39] (03CR) 10David Caro: "Hmm... those errors seem unrelated :/" [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [13:40:48] (03PS1) 10FNegri: [openstack] upgrade codfw1dev to Antelope (2023.1) [puppet] - 10https://gerrit.wikimedia.org/r/954056 (https://phabricator.wikimedia.org/T341285) [13:40:49] jbond: merging your change too [13:40:57] actually I downtimed this host for 24h [13:40:59] Why did it page? [13:41:17] RECOVERY - MariaDB Replica Lag: s1 #page on db1132 is OK: OK slave_sql_lag Replication lag: 26.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:41:35] marostegui: because the cookbook removes the downtime [13:41:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52212 and previous config saved to /var/cache/conftool/dbconfig/20230831-134136-root.json [13:41:38] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:41:46] Amir1: aaaah ok! [13:41:56] does anyone know if and what action needs to be taken when 1 proxies had sync errors during scap? [13:42:34] !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] (duration: 15m 00s) [13:42:39] T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098 [13:42:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jclark-ctr) @wiki_willy @Marostegui @RobH can we get some clarification on racking. ticket list Speed:1G Vlan. but came with 10g cards and on procurement doc list 10g.... [13:43:53] (KubernetesAPILatency) resolved: (13) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:44:09] godog: ack thanks [13:44:10] sergi0: That issue probably is bad timing between the appserver reboots we're doing and deployment, was it a codfw host? [13:44:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 13 CORE_DIFF 6 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43089/console" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [13:44:34] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2048.codfw.wmnet [13:44:36] (03CR) 10Jbond: "Latest pcc i think looks good, it has some differences but i think that's the change you want" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [13:44:37] claime: mw2259.codfw.wmnet indeed [13:44:46] (03PS1) 10Arnaudb: adding arnaudb to proper groups [puppet] - 10https://gerrit.wikimedia.org/r/953491 [13:44:52] sergi0: yeah, it's just been rebooted [13:45:15] claime: the backport process ended failing though, how do I proceed with this? [13:45:20] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1049.eqiad.wmnet [13:45:32] (03CR) 10CI reject: [V: 04-1] adding arnaudb to proper groups [puppet] - 10https://gerrit.wikimedia.org/r/953491 (owner: 10Arnaudb) [13:45:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P52213 and previous config saved to /var/cache/conftool/dbconfig/20230831-134535-ladsgroup.json [13:46:01] Well it's back up now, so I guess you can redo the backport, but I'd like someone that knows more about the deployment process than me to weigh in, Amir1 ? [13:46:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet [13:46:10] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1050.eqiad.wmnet [13:46:22] yeah, just redo the backport [13:46:41] alright, thanks [13:46:43] (SystemdUnitFailed) firing: (7) elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:47:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh2002.wikimedia.org with reason: host reimage [13:47:08] !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] [13:47:19] inflatador, ryankemper: would you have time to look into the readahead failure above? ^^^ [13:48:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [13:48:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [13:48:41] (03PS2) 10EoghanGaffney: gitlab: Remove swift configs and return gitlab1003 to restore group [puppet] - 10https://gerrit.wikimedia.org/r/953193 [13:48:46] !log sgimeno@deploy1002 soda and sgimeno: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:48:51] T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098 [13:49:08] !log sgimeno@deploy1002 soda and sgimeno: Continuing with sync [13:49:09] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 173 [13:49:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 173 [13:50:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh2002.wikimedia.org with reason: host reimage [13:50:17] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43090/console" [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney) [13:51:43] (SystemdUnitFailed) firing: (7) elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:56] gehel :eyes [13:52:37] (03PS1) 10Ladsgroup: mysql: Stop removing the downtime after clone is done [cookbooks] - 10https://gerrit.wikimedia.org/r/954059 [13:52:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:52:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Marostegui) We order them with 10G just in case, but we only use the 1G one. [13:53:21] (03PS3) 10EoghanGaffney: gitlab: Remove swift configs and return gitlab1003 to restore group [puppet] - 10https://gerrit.wikimedia.org/r/953193 [13:53:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testreduce1002.eqiad.wmnet [13:53:40] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:53:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet [13:54:00] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [13:54:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P52214 and previous config saved to /var/cache/conftool/dbconfig/20230831-135411-ladsgroup.json [13:54:27] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet [13:54:52] (03PS4) 10EoghanGaffney: gitlab: Remove swift configs and return gitlab1003 to restore group [puppet] - 10https://gerrit.wikimedia.org/r/953193 [13:55:18] (03CR) 10Muehlenhoff: adding arnaudb to proper groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953491 (owner: 10Arnaudb) [13:56:01] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43091/console" [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney) [13:56:04] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testreduce1002.eqiad.wmnet - jmm@cumin2002" [13:56:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) We have data https://grafana.wikimedia.org/d/iUATvNzSz/network-queues ! And a doc: https://wikitech.wikimedia.org/wiki/Netwo... [13:56:30] (03PS2) 10Arnaudb: admin [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) [13:56:42] !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] (duration: 09m 33s) [13:56:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 3%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52215 and previous config saved to /var/cache/conftool/dbconfig/20230831-135641-root.json [13:56:47] T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098 [13:56:49] 10ops-codfw: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T345356 (10phaultfinder) [13:56:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testreduce1002.eqiad.wmnet - jmm@cumin2002" [13:56:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:56:53] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache testreduce1002.eqiad.wmnet on all recursors [13:56:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testreduce1002.eqiad.wmnet on all recursors [13:57:02] Amir1: 2 (other) hosts had scap-cdb-rebuild errors. Should I redo the backport again? Or on the contrary pause until hosts are back [13:57:24] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testreduce1002.eqiad.wmnet - jmm@cumin2002" [13:57:25] (03CR) 10CI reject: [V: 04-1] admin [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [13:58:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testreduce1002.eqiad.wmnet - jmm@cumin2002" [13:58:13] (03CR) 10Marostegui: [C: 03+1] mysql: Stop removing the downtime after clone is done [cookbooks] - 10https://gerrit.wikimedia.org/r/954059 (owner: 10Ladsgroup) [13:58:15] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1050.eqiad.wmnet [13:58:25] (03PS3) 10Arnaudb: admin: Add arnaudb to root user group As part of his onboarding we have arnaudb doing the modifications and asked him to remove his modifications [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) [13:58:52] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet [13:58:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:59:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:59:10] retrying [13:59:11] (03CR) 10CI reject: [V: 04-1] admin: Add arnaudb to root user group As part of his onboarding we have arnaudb doing the modifications and asked him to remove his modifications [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [13:59:41] !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] [13:59:48] (03PS1) 10Gehel: java: introduce a standard list of GC logging options for Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/954060 (https://phabricator.wikimedia.org/T345355) [13:59:50] (03PS1) 10Gehel: query_service: use the standard GC logging options [puppet] - 10https://gerrit.wikimedia.org/r/954061 (https://phabricator.wikimedia.org/T345355) [14:00:01] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:00:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testreduce1002.eqiad.wmnet with OS bookworm [14:00:20] (03CR) 10CI reject: [V: 04-1] java: introduce a standard list of GC logging options for Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/954060 (https://phabricator.wikimedia.org/T345355) (owner: 10Gehel) [14:00:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet [14:00:41] (03PS4) 10Arnaudb: admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) [14:00:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P52216 and previous config saved to /var/cache/conftool/dbconfig/20230831-140041-ladsgroup.json [14:01:09] (03CR) 10Arnaudb: admin: Add arnaudb to root user group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [14:01:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:01:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet [14:01:19] !log sgimeno@deploy1002 sgimeno and soda: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:01:24] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:24] !log sgimeno@deploy1002 sgimeno and soda: Continuing with sync [14:01:27] (03CR) 10CI reject: [V: 04-1] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [14:02:06] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:03:02] (03PS1) 10Hashar: gitlab: add project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) [14:03:04] (03PS1) 10Hashar: gitlab: disable issue tracker by default on devtools [puppet] - 10https://gerrit.wikimedia.org/r/954064 (https://phabricator.wikimedia.org/T264231) [14:03:06] (03PS1) 10Hashar: gitlab: disable issue tracker by default on production [puppet] - 10https://gerrit.wikimedia.org/r/954065 (https://phabricator.wikimedia.org/T264231) [14:04:20] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:44] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:06] (03PS5) 10Arnaudb: admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) [14:05:43] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin1001" [14:05:44] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet [14:06:22] (03CR) 10Hashar: [C: 03+1] "I have cherry picked it on the devtools Puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:06:36] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet [14:06:42] (03CR) 10Arturo Borrero Gonzalez: "LGTM, but I would like to see a PCC for cloudgw at least." [puppet] - 10https://gerrit.wikimedia.org/r/953654 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [14:06:43] (SystemdUnitFailed) resolved: (3) elasticsearch-disable-readahead.service Failed on elastic2038:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet [14:07:18] !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:952928|Allow loading Edit-in-Sequence as a beta feature on Wikisources (T308098)]] (duration: 07m 36s) [14:07:23] T308098: Integrate edit-in-sequence inside ProofreadPage - https://phabricator.wikimedia.org/T308098 [14:07:31] yay! [14:07:32] Sohom_Datta: your change is finally live [14:07:35] :) [14:07:35] (KubernetesAPILatency) firing: (8) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:07:40] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet [14:07:41] going for mine [14:07:45] (03CR) 10Jon Harald Søby: "The wordmark still looks weird. I uploaded a new version of it to Commons now; could you update that one here as well?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [14:07:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by sgimeno@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno) [14:08:37] (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink backend for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954004 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno) [14:08:42] (03CR) 10Hashar: [C: 04-1] "I have miss read how the template is expanded. It requires all features to be listed in order to enable them when they are enabled by defa" [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:08:57] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:02] !log sgimeno@deploy1002 Started scap: Backport for [[gerrit:954004|GrowthExperiments: enable AddLink backend for swwiki (T308138 T308139)]] [14:09:10] T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 [14:09:10] T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 [14:09:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P52217 and previous config saved to /var/cache/conftool/dbconfig/20230831-140917-ladsgroup.json [14:10:14] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/953990 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [14:10:43] !log sgimeno@deploy1002 sgimeno: Backport for [[gerrit:954004|GrowthExperiments: enable AddLink backend for swwiki (T308138 T308139)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:10:58] !log sgimeno@deploy1002 sgimeno: Continuing with sync [14:11:25] (03PS39) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [14:11:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [14:11:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52219 and previous config saved to /var/cache/conftool/dbconfig/20230831-141146-root.json [14:11:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testreduce1002.eqiad.wmnet with reason: host reimage [14:12:16] (03CR) 10Hnowlan: [C: 03+2] service: add media-analytics service entry [puppet] - 10https://gerrit.wikimedia.org/r/951901 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [14:12:35] (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:12:39] (03PS1) 10Ayounsi: gNMIc: add interface description as metrics tag [puppet] - 10https://gerrit.wikimedia.org/r/954066 (https://phabricator.wikimedia.org/T326322) [14:13:05] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet [14:13:35] (03PS6) 10Arnaudb: admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) [14:14:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet [14:14:46] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:15:41] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet [14:15:44] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:15:45] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet [14:15:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T343718)', diff saved to https://phabricator.wikimedia.org/P52220 and previous config saved to /var/cache/conftool/dbconfig/20230831-141547-ladsgroup.json [14:15:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:15:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:15:52] (03CR) 10Muehlenhoff: [C: 03+2] networking fact: Remove check for stretch [puppet] - 10https://gerrit.wikimedia.org/r/953960 (owner: 10Muehlenhoff) [14:16:02] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:16:22] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:16:22] (03PS7) 10Arnaudb: admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) [14:16:37] !log sgimeno@deploy1002 Finished scap: Backport for [[gerrit:954004|GrowthExperiments: enable AddLink backend for swwiki (T308138 T308139)]] (duration: 07m 34s) [14:16:40] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:44] T308138: Deploy "add a link" to 13th round of wikis - https://phabricator.wikimedia.org/T308138 [14:16:44] T308139: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 [14:16:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on testreduce1002.eqiad.wmnet with reason: host reimage [14:17:36] (03PS6) 10Sergio Gimeno: GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) [14:17:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh2002.wikimedia.org with OS bookworm [14:17:53] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host doh2002.wikimedia.org with OS bookworm completed: - doh2002 (**PASS**) - Downtimed on Icinga/Al... [14:17:55] (03CR) 10Jcrespo: [C: 03+1] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [14:18:05] (03PS4) 10Sergio Gimeno: GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) [14:18:42] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [14:18:48] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:57] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for abran - https://phabricator.wikimedia.org/T345343 (10jcrespo) p:05Triage→03High a:03joanna_borun [14:19:19] (03CR) 10Arnaudb: [C: 03+2] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [14:19:52] (03PS1) 10Cwhite: alertmanager: emit helpful info for DatasourceError alerts [puppet] - 10https://gerrit.wikimedia.org/r/953492 (https://phabricator.wikimedia.org/T345358) [14:19:54] (03PS1) 10Cwhite: logstash: send generatorURL to labels [puppet] - 10https://gerrit.wikimedia.org/r/953493 (https://phabricator.wikimedia.org/T345358) [14:20:07] (03CR) 10Arnaudb: [C: 03+1] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [14:20:14] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:24] (03PS2) 10Cwhite: logstash: send generatorURL to labels [puppet] - 10https://gerrit.wikimedia.org/r/953493 (https://phabricator.wikimedia.org/T345358) [14:20:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10VRiley-WMF) pc1015 - A 6. U 33. Port 32. Cableid: 2839 [14:20:44] (03CR) 10Jcrespo: [C: 03+1] "I have verified everything, LGTM, only missing an additional +1 and Foundations or Director approval." [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [14:21:12] (03CR) 10Filippo Giunchedi: [C: 03+1] gNMIc: add interface description as metrics tag [puppet] - 10https://gerrit.wikimedia.org/r/954066 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [14:22:11] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet [14:22:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:50] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet [14:23:05] (03CR) 10Cwhite: "I'm not sure I like gating on alertname, but this class of alerts (along with DatasourceNoData) are "special" in a sense. Please let me kn" [puppet] - 10https://gerrit.wikimedia.org/r/953492 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite) [14:23:58] (03CR) 10Ayounsi: [C: 03+2] gNMIc: add interface description as metrics tag [puppet] - 10https://gerrit.wikimedia.org/r/954066 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [14:24:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52221 and previous config saved to /var/cache/conftool/dbconfig/20230831-142424-ladsgroup.json [14:24:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [14:24:31] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:24:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [14:24:40] (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: enable add a link in 12th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948144 (https://phabricator.wikimedia.org/T308137) (owner: 10Sergio Gimeno) [14:24:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52222 and previous config saved to /var/cache/conftool/dbconfig/20230831-142445-ladsgroup.json [14:24:53] (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) (owner: 10Sergio Gimeno) [14:24:55] (03PS1) 10Hnowlan: service: move geo-analytics and media-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/954067 (https://phabricator.wikimedia.org/T336380) [14:25:01] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a2-codfw.mgmt.codfw.wmnet [14:25:02] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:25:28] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52223 and previous config saved to /var/cache/conftool/dbconfig/20230831-142651-root.json [14:26:59] (03CR) 10Jelto: [C: 03+1] "lgtm now, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney) [14:27:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet [14:27:43] (03PS1) 10Muehlenhoff: package_builder: Clean up lintian setup [puppet] - 10https://gerrit.wikimedia.org/r/954068 [14:27:55] (03PS1) 10Jelto: miscweb/microsites: move monitoring of research pages to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/954069 (https://phabricator.wikimedia.org/T334511) [14:27:57] (03PS1) 10Jelto: miscweb/microsites: remove wikiworkshop and research resources [puppet] - 10https://gerrit.wikimedia.org/r/954070 (https://phabricator.wikimedia.org/T334511) [14:28:17] (03CR) 10CI reject: [V: 04-1] package_builder: Clean up lintian setup [puppet] - 10https://gerrit.wikimedia.org/r/954068 (owner: 10Muehlenhoff) [14:28:22] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:36] (CirrusSearchNodeIndexingNotIncreasing) resolved: Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:28:39] (03PS2) 10Anzx: tlywiki: Add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) [14:29:00] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet [14:29:19] (03PS1) 10Cathal Mooney: Remove parents for spine switches Eqiad row E/F [puppet] - 10https://gerrit.wikimedia.org/r/954071 (https://phabricator.wikimedia.org/T329272) [14:29:36] (03CR) 10David Caro: replica_cnf_api: add envvars backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:29:54] (03CR) 10Anzx: tlywiki: Add logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [14:31:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [14:31:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [14:31:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testreduce1002.eqiad.wmnet with OS bookworm [14:31:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testreduce1002.eqiad.wmnet [14:32:22] (03CR) 10Cathal Mooney: [C: 03+2] Remove parents for spine switches Eqiad row E/F [puppet] - 10https://gerrit.wikimedia.org/r/954071 (https://phabricator.wikimedia.org/T329272) (owner: 10Cathal Mooney) [14:32:44] (03CR) 10Ayounsi: [C: 03+1] Remove parents for spine switches Eqiad row E/F [puppet] - 10https://gerrit.wikimedia.org/r/954071 (https://phabricator.wikimedia.org/T329272) (owner: 10Cathal Mooney) [14:34:26] (03CR) 10Jon Harald Søby: tlywiki: add metanamespace , timezone, sitename (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [14:34:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet [14:35:17] (03CR) 10Majavah: [C: 03+2] team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [14:36:31] (03Merged) 10jenkins-bot: team-wmcs: Add CloudLB backend status checks [alerts] - 10https://gerrit.wikimedia.org/r/954052 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [14:38:08] (03CR) 10Jon Harald Søby: [C: 03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954050 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [14:39:04] (03CR) 10JMeybohm: [C: 03+1] service: move geo-analytics and media-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/954067 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [14:39:23] (03CR) 10Jelto: [C: 03+2] miscweb/microsites: move monitoring of research pages to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/954069 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [14:40:42] (03CR) 10Anzx: tlywiki: add metanamespace , timezone, sitename (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [14:41:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52224 and previous config saved to /var/cache/conftool/dbconfig/20230831-144155-root.json [14:42:19] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/953491/43094/" [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [14:42:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) kubernetes1031 - A 6. U 37. port 30 Cableid 4017 kubernetes1030 - A 6. U 36. port 31 Cableid 1917 kubernetes1029 - A 6. U 35. port24 Cableid: 1947 [14:43:44] (03PS2) 10Hashar: gitlab: add default_project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) [14:43:46] (03PS2) 10Hashar: gitlab: disable issue tracker by default on devtools [puppet] - 10https://gerrit.wikimedia.org/r/954064 (https://phabricator.wikimedia.org/T264231) [14:43:48] (03PS2) 10Hashar: gitlab: disable issue tracker by default on production [puppet] - 10https://gerrit.wikimedia.org/r/954065 (https://phabricator.wikimedia.org/T264231) [14:43:51] (03PS1) 10Hashar: gitlab: project_features > default_projects_features [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) [14:44:21] (03Abandoned) 10Hashar: gitlab: disable issue tracker by default on production [puppet] - 10https://gerrit.wikimedia.org/r/954065 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:44:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52225 and previous config saved to /var/cache/conftool/dbconfig/20230831-144425-ladsgroup.json [14:44:31] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:44:31] (03CR) 10CI reject: [V: 04-1] gitlab: project_features > default_projects_features [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:44:59] (03PS40) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [14:45:01] (03PS1) 10David Caro: sonofagridengine: pin openstacksdk to <1.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/954073 [14:45:36] (03Abandoned) 10Hashar: gitlab: disable issue tracker by default on devtools [puppet] - 10https://gerrit.wikimedia.org/r/954064 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:46:06] (03CR) 10CI reject: [V: 04-1] gitlab: add default_project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:46:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] sonofagridengine: pin openstacksdk to <1.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/954073 (owner: 10David Caro) [14:46:28] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices1006.eqiad.wmnet with OS bullseye [14:46:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet [14:47:12] (03CR) 10David Caro: team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [14:47:22] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:25] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10JJMC89) [14:47:28] (03PS2) 10Hashar: gitlab: project_features > default_projects_features [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) [14:47:30] (03PS3) 10Hashar: gitlab: add default_project_features parameter to profile [puppet] - 10https://gerrit.wikimedia.org/r/954063 (https://phabricator.wikimedia.org/T264231) [14:48:11] (03CR) 10David Caro: [C: 03+2] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:48:32] (03CR) 10David Caro: [C: 03+2] sonofagridengine: pin openstacksdk to <1.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/954073 (owner: 10David Caro) [14:48:48] (03CR) 10David Caro: [C: 03+2] replica_cnf_api: add envvars backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro) [14:49:19] (03CR) 10Hashar: "Our parameter use singular form `project_features` whereas upstream it is `default_projects_features` (with plural form for project). Ali" [puppet] - 10https://gerrit.wikimedia.org/r/954072 (https://phabricator.wikimedia.org/T264231) (owner: 10Hashar) [14:50:16] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:36] (03PS1) 10Majavah: team-wmcs: Move response time alert to correct prometheus instance [alerts] - 10https://gerrit.wikimedia.org/r/954074 [14:52:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet [14:52:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet [14:53:11] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10MoritzMuehlenhoff) [14:54:15] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: send generatorURL to labels [puppet] - 10https://gerrit.wikimedia.org/r/953493 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite) [14:54:56] (03CR) 10Cwhite: [C: 03+2] logstash: send generatorURL to labels [puppet] - 10https://gerrit.wikimedia.org/r/953493 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite) [14:55:11] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: emit helpful info for DatasourceError alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953492 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite) [14:55:16] (03PS2) 10Muehlenhoff: package_builder: Clean up lintian setup [puppet] - 10https://gerrit.wikimedia.org/r/954068 [14:56:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [14:56:59] dcaro: are your puppet changes ready for deploy? [14:57:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52226 and previous config saved to /var/cache/conftool/dbconfig/20230831-145700-root.json [14:57:41] cwhite: yes thanks! [14:57:59] done [14:58:04] thank :) [14:59:28] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10wikitech.wikimedia.org, 10Patch-For-Review: Switch developer account creation to Bitu - https://phabricator.wikimedia.org/T345226 (10MoritzMuehlenhoff) I've rolled out CAS 6.6.11 with an additional patch which points to Bitu for password resets and signups. [14:59:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P52227 and previous config saved to /var/cache/conftool/dbconfig/20230831-145931-ladsgroup.json [14:59:56] (03Abandoned) 10Muehlenhoff: Point IDP login page to IDM for signup [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/927661 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [14:59:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet [15:00:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet [15:02:11] (03PS1) 10Majavah: wikitech: Disable password resets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) [15:02:34] (03CR) 10Majavah: "Is IDM ready for this yet?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [15:05:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [15:05:12] (03CR) 10Andrew Bogott: [C: 03+1] "Agree with David about having more specific runbooks, otherwise lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [15:06:03] (03PS1) 10DDesouza: Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) [15:06:31] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [15:06:43] (03PS2) 10DDesouza: Undeploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950046 (https://phabricator.wikimedia.org/T336092) [15:10:36] (03PS2) 10DDesouza: Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) [15:11:26] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet [15:11:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [15:11:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [15:12:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52228 and previous config saved to /var/cache/conftool/dbconfig/20230831-151205-root.json [15:12:37] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin1001" [15:12:38] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1010.eqiad.wmnet with OS bullseye [15:12:57] (03CR) 10Marostegui: [C: 03+1] admin: Add arnaudb to root user group [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [15:13:26] (03CR) 10David Caro: team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [15:13:28] (03CR) 10Filippo Giunchedi: [C: 03+2] mesh: new configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/953575 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [15:14:23] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P52229 and previous config saved to /var/cache/conftool/dbconfig/20230831-151437-ladsgroup.json [15:14:47] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet [15:15:31] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:16:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] team-wmcs: Move response time alert to correct prometheus instance [alerts] - 10https://gerrit.wikimedia.org/r/954074 (owner: 10Majavah) [15:16:41] (03CR) 10Majavah: [C: 03+2] team-wmcs: Move response time alert to correct prometheus instance [alerts] - 10https://gerrit.wikimedia.org/r/954074 (owner: 10Majavah) [15:17:43] (03PS5) 10Anzx: tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) [15:17:53] (03Merged) 10jenkins-bot: team-wmcs: Move response time alert to correct prometheus instance [alerts] - 10https://gerrit.wikimedia.org/r/954074 (owner: 10Majavah) [15:21:03] (03CR) 10Jon Harald Søby: [C: 03+1] tlywiki: add metanamespace , timezone, sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953652 (https://phabricator.wikimedia.org/T345316) (owner: 10Anzx) [15:21:15] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:21:26] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet [15:22:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet [15:22:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [15:24:50] !log extend backup1009 lv by additional 10TiB [15:24:52] (03PS1) 10Volans: puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 [15:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P52230 and previous config saved to /var/cache/conftool/dbconfig/20230831-152710-root.json [15:27:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:28:17] 10SRE-tools, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Fabfur) [15:28:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [15:29:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet [15:29:10] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet [15:29:16] (03CR) 10Ladsgroup: [C: 03+1] "Thanks! Do you want me to merge it?" [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [15:29:29] (03CR) 10Ladsgroup: [C: 03+2] mysql: Stop removing the downtime after clone is done [cookbooks] - 10https://gerrit.wikimedia.org/r/954059 (owner: 10Ladsgroup) [15:29:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T343718)', diff saved to https://phabricator.wikimedia.org/P52231 and previous config saved to /var/cache/conftool/dbconfig/20230831-152943-ladsgroup.json [15:29:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [15:29:49] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:29:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [15:30:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52232 and previous config saved to /var/cache/conftool/dbconfig/20230831-153005-ladsgroup.json [15:30:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10VRiley-WMF) db1227 - A 7. U 24. [15:31:01] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:49] (03CR) 10CI reject: [V: 04-1] puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans) [15:32:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52233 and previous config saved to /var/cache/conftool/dbconfig/20230831-153217-ladsgroup.json [15:32:21] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:29] (03Merged) 10jenkins-bot: mysql: Stop removing the downtime after clone is done [cookbooks] - 10https://gerrit.wikimedia.org/r/954059 (owner: 10Ladsgroup) [15:35:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [15:35:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet [15:35:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet [15:36:26] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet [15:37:22] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) a:03Trizek-WMF @kamila, thank you for asking for our support. We have a message ready for commu... [15:37:40] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) p:05Triage→03High [15:39:01] !log failover ganeti master in ulsfo to ganeti4005 [15:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10VRiley-WMF) db1235 - A 3. U 40. port 34 Cableid 1903 [15:39:21] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:21] (03CR) 10Hnowlan: [C: 03+2] service: move geo-analytics and media-analytics to production [puppet] - 10https://gerrit.wikimedia.org/r/954067 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [15:39:52] (03CR) 10Jcrespo: [C: 03+1] "The patch itself is ready, but we are waiting on ticket for Jobo's ok (to follow procedure)." [puppet] - 10https://gerrit.wikimedia.org/r/953491 (https://phabricator.wikimedia.org/T345343) (owner: 10Arnaudb) [15:40:19] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet [15:40:32] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet [15:42:09] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954068 (owner: 10Muehlenhoff) [15:42:11] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:57] PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [15:44:40] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T336380) [15:44:46] T336380: AQS 2.0: Media Analytics Service - Deploy to staging and production - https://phabricator.wikimedia.org/T336380 [15:45:04] (03PS2) 10Volans: puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 [15:45:06] (03PS1) 10Volans: tox.ini: add compatibility with newer Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/954087 [15:45:08] (03PS1) 10Volans: puppetdb: ignore bandit false positive B113 [software/cumin] - 10https://gerrit.wikimedia.org/r/954088 [15:45:40] (03CR) 10Subramanya Sastry: [C: 03+2] "Adding Editing folks for visibility into this cherry-pick." [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [15:45:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T336380) [15:46:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet [15:46:48] (03CR) 10Subramanya Sastry: [C: 03+2] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [15:47:17] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10Fabfur) [15:47:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P52234 and previous config saved to /var/cache/conftool/dbconfig/20230831-154724-ladsgroup.json [15:48:12] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T336380) [15:49:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [15:49:07] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T336380) [15:49:26] (03CR) 10Subramanya Sastry: [C: 04-2] "oops I shouldn't be +2ing backports." [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [15:49:34] (03CR) 10Subramanya Sastry: [C: 04-2] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [15:49:50] (03CR) 10Subramanya Sastry: [C: 04-2] "brain fart and loss of focus .. I shouldn't have been +2ing these." [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [15:51:17] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:58] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Volans) For context some cookbooks that deems what they are doing dangerous already do that, for example the aforementioned `sre.hosts.reimage`... [15:52:22] (03CR) 10CI reject: [V: 04-1] puppetdb: ignore bandit false positive B113 [software/cumin] - 10https://gerrit.wikimedia.org/r/954088 (owner: 10Volans) [15:52:53] (03CR) 10CI reject: [V: 04-1] puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans) [15:53:02] (03CR) 10CI reject: [V: 04-1] tox.ini: add compatibility with newer Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/954087 (owner: 10Volans) [15:53:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:07] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:57] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet [15:55:17] (03CR) 10Volans: "CI failures are due to https://github.com/pyparsing/pyparsing/issues/501 for which I've sent https://github.com/pyparsing/pyparsing/pull/5" [software/cumin] - 10https://gerrit.wikimedia.org/r/954087 (owner: 10Volans) [15:55:52] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on cloudservices1006.eqiad.wmnet with reason: service bootstrap [15:56:06] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on cloudservices1006.eqiad.wmnet with reason: service bootstrap [15:57:09] (03PS3) 10Volans: puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 [15:57:25] (03Abandoned) 10Volans: puppetdb: ignore bandit false positive B113 [software/cumin] - 10https://gerrit.wikimedia.org/r/954088 (owner: 10Volans) [15:57:51] (03PS2) 10Majavah: team-wmcs: Add Galera checks [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) [15:58:18] (03CR) 10Majavah: team-wmcs: Add Galera checks (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [15:58:20] (03CR) 10Volans: "CI failures are due to https://github.com/pyparsing/pyparsing/issues/501 for which I've sent https://github.com/pyparsing/pyparsing/pull/5" [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans) [15:59:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:36] (03CR) 10Herron: "opening up for feedback to get the ball rolling. as-is it is broad in terms of affected hosts, so in addition to feedback on the patch it" [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [16:00:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:01:47] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10Volans) Improving the cookbook outputs and readability of it is surely always a great idea. I'm not sure though what are you proposing as actionable.... [16:02:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet [16:02:09] (03CR) 10Jbond: [C: 03+2] puppet: drop deprecated ignorecache switch [software/spicerack] - 10https://gerrit.wikimedia.org/r/953990 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [16:02:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P52235 and previous config saved to /var/cache/conftool/dbconfig/20230831-160230-ladsgroup.json [16:03:27] (03CR) 10CI reject: [V: 04-1] puppetdb: drop support for deprecated API v3 [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans) [16:04:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cumin] - 10https://gerrit.wikimedia.org/r/954087 (owner: 10Volans) [16:04:56] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [16:05:40] (03CR) 10David Caro: team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [16:06:09] (03Merged) 10jenkins-bot: puppet: drop deprecated ignorecache switch [software/spicerack] - 10https://gerrit.wikimedia.org/r/953990 (https://phabricator.wikimedia.org/T341496) (owner: 10Jbond) [16:06:52] (03CR) 10Volans: [V: 03+2 C: 03+2] tox.ini: add compatibility with newer Sphinx [software/cumin] - 10https://gerrit.wikimedia.org/r/954087 (owner: 10Volans) [16:09:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/954081 (owner: 10Volans) [16:09:52] (03PS1) 10Bking: wdqs: re-enable alerts on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/954093 (https://phabricator.wikimedia.org/T344518) [16:13:54] (03PS3) 10Majavah: team-wmcs: Add Galera checks [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) [16:14:12] (03CR) 10Majavah: team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [16:17:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T343718)', diff saved to https://phabricator.wikimedia.org/P52236 and previous config saved to /var/cache/conftool/dbconfig/20230831-161736-ladsgroup.json [16:17:44] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [16:19:34] PROBLEM - Host cp5018 is DOWN: CRITICAL - Time to live exceeded (10.132.0.18) [16:19:34] PROBLEM - Host cp5019 is DOWN: CRITICAL - Time to live exceeded (10.132.0.19) [16:19:34] PROBLEM - Host cp5023 is DOWN: CRITICAL - Time to live exceeded (10.132.0.34) [16:19:34] PROBLEM - Host cp5028 is DOWN: CRITICAL - Time to live exceeded (10.132.0.25) [16:19:34] PROBLEM - Host cp5025 is DOWN: CRITICAL - Time to live exceeded (10.132.0.36) [16:19:34] PROBLEM - Host cp5030 is DOWN: CRITICAL - Time to live exceeded (10.132.0.27) [16:19:35] PROBLEM - Host asw2-ulsfo is DOWN: CRITICAL - Time to live exceeded (10.128.128.7) [16:19:39] PROBLEM - Host pfw3-codfw #page is DOWN: CRITICAL - Time to live exceeded (208.80.153.197) [16:19:50] RECOVERY - Host cp5019 is UP: PING OK - Packet loss = 0%, RTA = 249.02 ms [16:19:50] RECOVERY - Host cp5023 is UP: PING OK - Packet loss = 0%, RTA = 235.25 ms [16:19:50] RECOVERY - Host cp5025 is UP: PING OK - Packet loss = 0%, RTA = 243.01 ms [16:19:50] RECOVERY - Host cp5028 is UP: PING OK - Packet loss = 0%, RTA = 303.26 ms [16:19:50] RECOVERY - Host cp5030 is UP: PING OK - Packet loss = 0%, RTA = 242.93 ms [16:19:50] RECOVERY - Host cp5018 is UP: PING OK - Packet loss = 0%, RTA = 330.15 ms [16:19:52] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.54 ms [16:19:55] RECOVERY - Host pfw3-codfw #page is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [16:20:13] hello [16:21:16] hmm recovered so quickly that it didn't page on victorops but that's fine [16:21:29] something did happen here [16:25:27] XioNoX: topranks: ^ sorry for the late ping but this might be worth a look [16:26:45] are we aware of any scheduled maintenance? [16:27:53] nothing's on the calendar AFAICT [16:28:07] yeah, nothing to noc@ as well as I can see [16:29:09] the closest thing I see is the revert https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/ad0775e516cae00163e4eb0bdf0da1077162d425%5E%21/#F0 but I am not sure how this can be related given that it was already reverted [16:29:28] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:29:50] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1030.eqiad.wmnet [16:30:08] 208.80.153.220 Down xe-1/1/1:3.0 6.000 2.000 3 [16:31:02] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10RZamora-WMF) Thanks for claiming this Phab task 👍 [16:34:21] (03PS2) 10Jelto: miscweb/microsites: remove wikiworkshop and research resources [puppet] - 10https://gerrit.wikimedia.org/r/954070 (https://phabricator.wikimedia.org/T334511) [16:36:01] (03CR) 10Jelto: [C: 03+2] miscweb/microsites: remove wikiworkshop and research resources [puppet] - 10https://gerrit.wikimedia.org/r/954070 (https://phabricator.wikimedia.org/T334511) (owner: 10Jelto) [16:36:44] sukhe: hey just looking, not aware of anything no [16:36:58] TTL exceeded suggests some routing issue though hmmm [16:37:13] (03PS1) 10Majavah: openstack: Remove a bunch of Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/954102 (https://phabricator.wikimedia.org/T345294) [16:37:17] yeah... which is I guess what makes me worried, even though it was a flap [16:37:35] I'm only catching up with your later messages, was a transport link flapping? [16:39:28] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:39:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43097/console" [puppet] - 10https://gerrit.wikimedia.org/r/954102 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [16:39:39] ^^^ this was after manual clearing of bfd session [16:40:08] topranks: yeah that was .220 above or xe-4-2-0.cr1-eqiad.wikimedia.org but I am not sure if that's related (that's eqiad -> codfw though?) [16:40:16] the cp5* are eqsin [16:40:30] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a2-codfw - cmooney@cumin1001" [16:41:18] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a2-codfw - cmooney@cumin1001" [16:41:18] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:33] sukhe: it could possibly be if traffic was getting sent eqiad->codfw and back again due to link flapping [16:41:44] but I've no reason to suspect that for sure [16:42:03] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host restbase1030.eqiad.wmnet [16:42:33] that link was flapping up/down like mad since 16:11 alright [16:42:51] ah [16:43:39] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) Thank you @Trizek-WMF ! The message looks good. Maybe I'd suggest replacing the word "first" with "prim... [16:45:27] sukhe: I'm gonna assume it was that alright. Packets seeing best path via that link, then not, then seeing it again [16:45:30] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) Thanks, I submitted the on-boarding form, let's see what happens now. [16:45:39] we have the same for asw1-ulsfo there as well, so not just eqsin affected [16:45:57] TTL exceeded essentially means packet was in a routing loop [16:46:14] and likely reason for that is link flapping [16:47:43] (03PS1) 10Majavah: icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104 [16:47:48] topranks: thanks for checking and confirming [16:48:01] np, I'll keep an eye on it, seems stable right now anyway [16:48:02] since it was a flap, do you still think it merits a task? I can file one [16:48:06] (03CR) 10CI reject: [V: 04-1] icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104 (owner: 10Majavah) [16:48:17] yeah I'm just doing one here [16:48:21] <3 [16:48:45] (03CR) 10David Caro: [C: 03+1] team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [16:49:04] (03CR) 10Ori: [C: 03+1] "I can merge this if you like." [puppet] - 10https://gerrit.wikimedia.org/r/952488 (https://phabricator.wikimedia.org/T321099) (owner: 10Jforrester) [16:49:10] (03PS2) 10Majavah: icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104 [16:49:33] (03CR) 10CI reject: [V: 04-1] icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104 (owner: 10Majavah) [16:49:53] (03CR) 10Cwhite: [C: 03+2] alertmanager: emit helpful info for DatasourceError alerts [puppet] - 10https://gerrit.wikimedia.org/r/953492 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite) [16:49:54] PROBLEM - Host restbase1030 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:20] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T345380 (10phaultfinder) [16:51:55] uh oh, one more [16:52:45] (03CR) 10Majavah: [C: 03+2] team-wmcs: Add Galera checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [16:53:29] (03PS3) 10Majavah: icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104 [16:53:57] (03Merged) 10jenkins-bot: team-wmcs: Add Galera checks [alerts] - 10https://gerrit.wikimedia.org/r/953727 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [16:53:58] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:55:21] urandom: SAL suggests you were working on restbase1030 [16:55:46] it's not TTL exceeded like the previous batch anyway (plus in eqiad) [16:55:56] sukhe: yes [16:56:04] topranks: yeah, this one is definitely unrelated! [16:56:14] sukhe: why, is it alerting? I thought I downtimed it. [16:56:31] urandom: no idea, just thought I should let you know in case some action is required :) [16:56:32] host down alert above yeah [16:56:34] no big deal [16:56:45] urandom: want me to downtime it again? [16:57:03] (03PS4) 10Majavah: icinga: Don't tie wikitech-static alerts to cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954104 [16:57:12] sukhe: I just did :/ [16:58:23] even the bots are giving up today [16:58:40] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-08-28-113303-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954110 [16:59:10] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43100/console" [puppet] - 10https://gerrit.wikimedia.org/r/954104 (owner: 10Majavah) [16:59:52] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-08-28-113303-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954110 (owner: 10BryanDavis) [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1700) [17:00:38] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-08-28-113303-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954110 (owner: 10BryanDavis) [17:01:29] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:01:58] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:02:04] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:02:39] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:02:54] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:03:28] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:07:22] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:11:50] huh, figured jinxer would rejoin automagically but it appears not configured to do so. probably will rejoin when an alert fires? [17:12:21] cwhite: it did that in #wikimedia-cloud-feed [17:12:34] good to know, thanks :) [17:12:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-a2-codfw.mgmt.codfw.wmnet [17:16:43] (03PS1) 10Cwhite: alertmanager: add link to DatasourceError runbook [puppet] - 10https://gerrit.wikimedia.org/r/953495 (https://phabricator.wikimedia.org/T345358) [17:18:07] (03CR) 10Cwhite: "Any concerns about the overall message length?" [puppet] - 10https://gerrit.wikimedia.org/r/953495 (https://phabricator.wikimedia.org/T345358) (owner: 10Cwhite) [17:32:26] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/954114 [17:43:53] (03CR) 10Andrew Bogott: [C: 03+1] "Adam -- this is prep work for the upcoming OpenStack upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [17:44:09] (03CR) 10Andrew Bogott: [C: 03+1] "Adam -- this is prep work for the upcoming OpenStack upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/953252 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [18:00:05] jeena and dduvall: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T1800) [18:04:32] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954117 (https://phabricator.wikimedia.org/T343726) [18:04:34] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954117 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [18:05:18] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954117 (https://phabricator.wikimedia.org/T343726) (owner: 10TrainBranchBot) [18:12:01] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.24 refs T343726 [18:12:07] T343726: 1.41.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T343726 [18:20:01] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:39:55] (03PS1) 10Ryan Kemper: wdqs: use proper sparql endpoint [puppet] - 10https://gerrit.wikimedia.org/r/954119 (https://phabricator.wikimedia.org/T337296) [18:40:19] (03CR) 10Gehel: [C: 03+1] wdqs: use proper sparql endpoint [puppet] - 10https://gerrit.wikimedia.org/r/954119 (https://phabricator.wikimedia.org/T337296) (owner: 10Ryan Kemper) [18:40:26] (03CR) 10Bking: [C: 03+1] wdqs: use proper sparql endpoint [puppet] - 10https://gerrit.wikimedia.org/r/954119 (https://phabricator.wikimedia.org/T337296) (owner: 10Ryan Kemper) [18:40:32] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: use proper sparql endpoint [puppet] - 10https://gerrit.wikimedia.org/r/954119 (https://phabricator.wikimedia.org/T337296) (owner: 10Ryan Kemper) [18:44:40] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [18:44:40] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [18:44:50] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [18:44:50] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [18:46:07] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [18:46:26] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:27] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [18:48:30] (03Abandoned) 10Arlolra: Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/954048 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [18:49:18] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:19] (03PS1) 10Cathal Mooney: Correct sysctl value for net.ipv4.tcp_min_snd_mss [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) [18:54:54] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [18:56:55] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [18:57:32] (03CR) 10Subramanya Sastry: [C: 03+1] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [19:03:21] !log T344198 Temporarily disabling puppet on all `wdqs*` hosts in preparation for `wdqs.discovery.wmnet` certificate revocation [19:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:33] T344198: Decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T344198 [19:03:41] !log T344198 on `ryankemper@cumin1001`: `sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T344198"'` [19:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:31] (03PS1) 10Bartosz Dziewoński: WatchlistManager: Do not require watchlist rights for clearing talk page notification [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953660 (https://phabricator.wikimedia.org/T345031) [19:07:41] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [19:09:42] (SystemdUnitFailed) firing: (5) wcqs-updater.service Failed on wcqs1001:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:05] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device lsw1-a3-codfw.mgmt.codfw.wmnet [19:14:07] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:14:42] (SystemdUnitFailed) resolved: (4) wcqs-updater.service Failed on wcqs1001:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:47] (03PS1) 10Ryan Kemper: wdqs: new wqds.discovery cert [puppet] - 10https://gerrit.wikimedia.org/r/954123 (https://phabricator.wikimedia.org/T344198) [19:16:03] (03CR) 10Bking: [C: 03+1] wdqs: new wqds.discovery cert [puppet] - 10https://gerrit.wikimedia.org/r/954123 (https://phabricator.wikimedia.org/T344198) (owner: 10Ryan Kemper) [19:17:23] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: new wqds.discovery cert [puppet] - 10https://gerrit.wikimedia.org/r/954123 (https://phabricator.wikimedia.org/T344198) (owner: 10Ryan Kemper) [19:21:04] (03CR) 10Urbanecm: [C: 03+1] WatchlistManager: Do not require watchlist rights for clearing talk page notification [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953660 (https://phabricator.wikimedia.org/T345031) (owner: 10Bartosz Dziewoński) [19:28:33] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs1005.eqiad.wmnet [19:30:18] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [19:30:57] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-a3-codfw - cmooney@cumin1001" [19:33:20] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [19:37:43] (03PS7) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [19:37:45] (03CR) 10Ebernhardson: Draft: cirrus streaming updater service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [19:41:06] (03CR) 10BBlack: [C: 03+1] Correct sysctl value for net.ipv4.tcp_min_snd_mss [puppet] - 10https://gerrit.wikimedia.org/r/954120 (https://phabricator.wikimedia.org/T344829) (owner: 10Cathal Mooney) [19:44:35] !log ryankemper@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [19:45:41] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh6002.wikimedia.org with OS bookworm [19:45:51] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh6002.wikimedia.org with OS bookworm [19:48:02] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Umherirrender) There is a (small) spike in grafana... [19:48:31] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [19:49:58] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:50:24] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:51:33] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [19:51:41] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [19:53:11] (03CR) 10Dr0ptp4kt: [openstack] remove deprecated option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953252 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [19:53:57] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:59:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2039.codfw.wmnet with OS bullseye [19:59:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye [20:00:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye [20:00:05] brennen and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230831T2000). [20:00:05] arlolra, danisztls, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2039.codfw.wmnet with OS bullseye [20:00:07] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2038.codfw.wmnet with OS bullseye [20:00:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye [20:00:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2037.codfw.wmnet with OS bullseye [20:00:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [20:00:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [20:00:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye [20:00:46] hi [20:01:12] I'm unable to deploy (cc brennen) [20:01:18] I can deploy [20:01:19] I can deploy [20:01:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye [20:01:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye [20:01:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2038.codfw.wmnet with OS bullseye [20:01:47] jeena: we're doing deployment training if'n you're interested in joining :) [20:01:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [20:01:53] okay sure [20:02:10] (03CR) 10Dr0ptp4kt: New files/templates for OpenStack Antelope (2023.1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: 10FNegri) [20:03:11] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] gitlab: Remove swift configs and return gitlab1003 to restore group [puppet] - 10https://gerrit.wikimedia.org/r/953193 (owner: 10EoghanGaffney) [20:05:36] MatmaRex: I can start with yours if you're ready [20:05:42] sure [20:06:02] arlolra: danisztls hi [20:06:07] hello [20:06:12] hi [20:06:32] I'll continue with your patches after doing MatmaRex's [20:06:40] ok [20:07:02] thanks [20:07:25] actually I'll do the config patches first, sorry about that [20:07:45] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye [20:07:52] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -... [20:07:59] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh6002.wikimedia.org with reason: host reimage [20:08:39] (03CR) 10Jeena Huneidi: [C: 03+2] WatchlistManager: Do not require watchlist rights for clearing talk page notification [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953660 (https://phabricator.wikimedia.org/T345031) (owner: 10Bartosz Dziewoński) [20:09:14] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:09:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950046 (https://phabricator.wikimedia.org/T336092) (owner: 10DDesouza) [20:09:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza) [20:10:18] (03Merged) 10jenkins-bot: Undeploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950046 (https://phabricator.wikimedia.org/T336092) (owner: 10DDesouza) [20:10:20] (03CR) 10CI reject: [V: 04-1] Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza) [20:10:51] I will have to rebase my other change [20:10:54] thanks [20:11:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2039.codfw.wmnet with OS bullseye [20:11:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye [20:11:14] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2039.codfw.wmnet with OS bullseye [20:11:21] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [20:11:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh6002.wikimedia.org with reason: host reimage [20:11:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [20:11:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:11:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs1005.eqiad.wmnet [20:13:19] (03CR) 10Bking: [C: 03+2] Start Blazegraph from systemd unit, without runBlazegraph.sh [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [20:13:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2039.codfw.wmnet with OS bullseye [20:13:33] (03PS3) 10DDesouza: Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) [20:13:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye [20:13:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2039.codfw.wmnet with OS bullseye [20:13:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2039.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [20:14:20] 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T345391 (10RKemper) [20:14:39] 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3-5] - https://phabricator.wikimedia.org/T345391 (10RKemper) [20:16:24] !log 'bking@wdqs1004 depool wdqs1004 to test script changes T342361' [20:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:30] T342361: Examine/refactor WDQS startup scripts - https://phabricator.wikimedia.org/T342361 [20:17:45] danisztls: you still want to deploy 954079 in this window, right? [20:17:57] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:18:05] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:18:06] 10ops-eqiad, 10decommission-hardware: decommission wdqs1005 - https://phabricator.wikimedia.org/T345391 (10RKemper) [20:18:08] 10ops-eqiad, 10decommission-hardware: decommission wdqs1005 - https://phabricator.wikimedia.org/T345391 (10RKemper) [20:18:32] jeena: yes, if possible [20:18:41] already rebased it [20:18:43] ok, just making sure [20:18:47] great [20:19:01] sorry I didn't notice! [20:19:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza) [20:19:18] np [20:19:54] (03Merged) 10jenkins-bot: Pre-deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954079 (https://phabricator.wikimedia.org/T345158) (owner: 10DDesouza) [20:20:09] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:950046|Undeploy Research Incentive survey on enwiki (T336092)]], [[gerrit:954079|Pre-deploy Campaigns Event Discovery survey (T345158)]] [20:20:16] T336092: Deploy Research Incentive Survey on English Wikipedia - https://phabricator.wikimedia.org/T336092 [20:20:16] T345158: Deploy QuickSurvey for Campaigns Event Discovery project - https://phabricator.wikimedia.org/T345158 [20:20:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2038.codfw.wmnet with OS bullseye [20:20:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye [20:20:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2038.codfw.wmnet with OS bullseye [20:20:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2038.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [20:21:04] (03PS1) 10Ebernhardson: Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 [20:21:27] (03CR) 10CI reject: [V: 04-1] Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson) [20:21:50] !log jhuneidi@deploy1002 jhuneidi and dani: Backport for [[gerrit:950046|Undeploy Research Incentive survey on enwiki (T336092)]], [[gerrit:954079|Pre-deploy Campaigns Event Discovery survey (T345158)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:22:21] danisztls: ready for you to do any checks on mwdebug before syncing [20:22:21] (03Merged) 10jenkins-bot: WatchlistManager: Do not require watchlist rights for clearing talk page notification [core] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/953660 (https://phabricator.wikimedia.org/T345031) (owner: 10Bartosz Dziewoński) [20:23:07] (03PS1) 10Bking: Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/953661 [20:23:12] jeena: i don't really want to test this in production with my IP address, i tested locally earlier though [20:23:37] jeena: first change looks good [20:23:42] (03CR) 10Bking: [C: 03+2] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/953661 (owner: 10Bking) [20:23:54] (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "Start Blazegraph from systemd unit, without runBlazegraph.sh" [puppet] - 10https://gerrit.wikimedia.org/r/953661 (owner: 10Bking) [20:23:58] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:23:59] MatmaRex: 👍 [20:24:45] (03PS2) 10Ebernhardson: Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 [20:25:14] (03CR) 10CI reject: [V: 04-1] Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson) [20:25:40] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:26:07] danisztls: how about the second one? [20:26:25] (03PS3) 10Ebernhardson: Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 [20:26:27] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:26:35] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:26:40] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:27:06] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:27:14] jeena: not [20:27:15] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:27:41] possible because messages haven't been created yet [20:27:50] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [20:28:02] is it okay to sync? [20:28:03] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:28:09] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:28:17] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['restbase1030.eqiad.wmnet'] [20:28:48] yeah, coverage is 0 [20:28:56] okay thanks [20:29:07] !log jhuneidi@deploy1002 jhuneidi and dani: Continuing with sync [20:29:41] (03PS4) 10Ebernhardson: Provide zookeeper hosts in helmfile defaults [puppet] - 10https://gerrit.wikimedia.org/r/954126 [20:32:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh6002.wikimedia.org with OS bookworm [20:32:29] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh6002.wikimedia.org with OS bookworm completed: - doh6002 (**WARN**) - Downtimed on Icinga/Al... [20:34:29] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:950046|Undeploy Research Incentive survey on enwiki (T336092)]], [[gerrit:954079|Pre-deploy Campaigns Event Discovery survey (T345158)]] (duration: 14m 19s) [20:34:35] T336092: Deploy Research Incentive Survey on English Wikipedia - https://phabricator.wikimedia.org/T336092 [20:34:36] T345158: Deploy QuickSurvey for Campaigns Event Discovery project - https://phabricator.wikimedia.org/T345158 [20:34:43] (03PS1) 10Andrew Bogott: cinder backups: move paws to cloudbackup2002; backup life to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/954130 [20:34:45] (03PS1) 10Andrew Bogott: wmcs-backup: support removal of unhandled image backups [puppet] - 10https://gerrit.wikimedia.org/r/954131 [20:35:06] danisztls: all synced [20:35:14] MatmaRex: starting yours now [20:35:27] jeena: thanks! [20:36:12] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:953660|WatchlistManager: Do not require watchlist rights for clearing talk page notification (T345031)]] [20:36:18] T345031: New messages notification cannot be dismissed by unregistered users - https://phabricator.wikimedia.org/T345031 [20:36:21] (03CR) 10Jeena Huneidi: [C: 03+2] Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [20:37:38] !log jhuneidi@deploy1002 jhuneidi and matmarex: Backport for [[gerrit:953660|WatchlistManager: Do not require watchlist rights for clearing talk page notification (T345031)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:37:55] !log jhuneidi@deploy1002 jhuneidi and matmarex: Continuing with sync [20:39:21] (03PS1) 10Stevemunene: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) [20:42:17] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:43:14] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:953660|WatchlistManager: Do not require watchlist rights for clearing talk page notification (T345031)]] (duration: 07m 01s) [20:43:19] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk2001.codfw.wmnet [20:43:20] T345031: New messages notification cannot be dismissed by unregistered users - https://phabricator.wikimedia.org/T345031 [20:43:23] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:43:37] MatmaRex: synced [20:43:39] thanks jeena [20:43:48] (03CR) 10Ebernhardson: "PCC is failing for a real problem in the existing common.yaml. The problem is new zookeeper instances are being added and they have been d" [puppet] - 10https://gerrit.wikimedia.org/r/954126 (owner: 10Ebernhardson) [20:44:12] arlolra: still there? [20:44:17] yup [20:44:26] 😅 [20:44:31] 👍 starting yours now [20:44:37] thanks [20:45:05] I submitted it already to speed up a little [20:45:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [20:45:42] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for doh6002.wikimedia.org [20:45:43] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh6002.wikimedia.org [20:46:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:46:56] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host doh5002.wikimedia.org with OS bookworm [20:47:06] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host doh5002.wikimedia.org with OS bookworm [20:47:28] !log bking@cumin1001 START - Cookbook sre.dns.netbox [20:50:38] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [20:50:59] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:51:01] (03Merged) 10jenkins-bot: Use metrics from SiteConfig to restore the Parsoid prefix [extensions/VisualEditor] (wmf/1.41.0-wmf.24) - 10https://gerrit.wikimedia.org/r/954049 (https://phabricator.wikimedia.org/T339365) (owner: 10Arlolra) [20:51:14] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:954049|Use metrics from SiteConfig to restore the Parsoid prefix (T339365)]] [20:51:20] T339365: Fix Parsoid metrics - https://phabricator.wikimedia.org/T339365 [20:51:44] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [20:51:44] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:51:45] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk2001.codfw.wmnet [20:52:07] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:52:38] !log jhuneidi@deploy1002 arlolra and jhuneidi: Backport for [[gerrit:954049|Use metrics from SiteConfig to restore the Parsoid prefix (T339365)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:53:02] arlolra: ready for you to check on mwdebug [20:53:10] alrighty [20:53:21] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:47] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:40] jeena: looks good [20:55:50] 👍 [20:55:55] !log jhuneidi@deploy1002 arlolra and jhuneidi: Continuing with sync [20:57:21] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [20:57:27] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [21:00:24] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk2003.codfw.wmnet [21:01:18] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:954049|Use metrics from SiteConfig to restore the Parsoid prefix (T339365)]] (duration: 10m 03s) [21:01:25] T339365: Fix Parsoid metrics - https://phabricator.wikimedia.org/T339365 [21:01:52] thank you jeena [21:02:03] you're welcome! [21:02:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:02:55] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757 (10nshahquinn-wmf) >>! In T158757#9133055, @jbond wrote: > Its worth noting that once services have been migrated to the new puppet7 infrastructure then agent certificates... [21:04:42] !log bking@cumin1001 START - Cookbook sre.dns.netbox [21:06:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [21:07:17] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [21:07:22] PROBLEM - Check systemd state on wdqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:07:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad- https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:07:47] (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: move paws to cloudbackup2002; backup life to 10 days [puppet] - 10https://gerrit.wikimedia.org/r/954130 (owner: 10Andrew Bogott) [21:08:18] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [21:08:18] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:08:19] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk2003.codfw.wmnet [21:09:50] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:45] (03PS1) 10Bking: flink-zk: Move codfw hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/954134 (https://phabricator.wikimedia.org/T341792) [21:11:00] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:13:29] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2037.codfw.wmnet with OS bullseye [21:13:36] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes2037.codfw.wmnet with OS bullseye executed with errors: - kubernetes... [21:25:27] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye [21:25:34] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -... [21:35:07] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh5002.wikimedia.org with reason: host reimage [21:38:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh5002.wikimedia.org with reason: host reimage [22:13:10] (03PS2) 10Caenus: Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) [22:15:23] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:15:47] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:17:05] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh5002.wikimedia.org with OS bookworm [22:17:16] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host doh5002.wikimedia.org with OS bookworm completed: - doh5002 (**PASS**) - Downtimed on Icinga/Al... [22:17:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [22:58:59] PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.7% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [23:04:42] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs1016:9100- https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:10:54] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS bullseye [23:11:01] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye [23:15:15] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [23:15:15] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [23:16:26] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [23:20:47] (03CR) 10Cwhite: "Thanks for putting this together!" [puppet] - 10https://gerrit.wikimedia.org/r/952894 (https://phabricator.wikimedia.org/T345377) (owner: 10Herron) [23:25:35] (03PS1) 10Andrea Denisse: librenms: Add PHP version for Debian Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/954143 (https://phabricator.wikimedia.org/T344136) [23:46:10] (03PS3) 10Tim Starling: Raise LoginNotify minimum log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952564 (https://phabricator.wikimedia.org/T174200) [23:53:08] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1030.eqiad.wmnet with OS bullseye [23:53:15] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1030.eqiad.wmnet with OS bullseye executed with errors: -... [23:54:40] (03CR) 10Tim Starling: [C: 03+2] Raise LoginNotify minimum log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952564 (https://phabricator.wikimedia.org/T174200) (owner: 10Tim Starling) [23:55:22] (03Merged) 10jenkins-bot: Raise LoginNotify minimum log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952564 (https://phabricator.wikimedia.org/T174200) (owner: 10Tim Starling)