[00:00:40] (03CR) 10Jbond: "this will need rebasing after the patches for the following patchsets" [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [00:03:03] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:08:45] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:43] RECOVERY - Check systemd state on grafana1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:31] (03PS12) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [00:22:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Papaul) @Andrew sorry updating the task now. Here is my finding on 1047 . I tried to enable pxe boot on both 10G ports and did a mac... [00:22:41] (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [00:26:00] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@7975c27]: (no justification provided) [00:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:08] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@7975c27]: (no justification provided) (duration: 00m 08s) [00:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:41] (03PS1) 10Eevans: Create Cassandra role for Image Suggestions dataset [puppet] - 10https://gerrit.wikimedia.org/r/769587 (https://phabricator.wikimedia.org/T295405) [00:41:33] (03CR) 10Eevans: [C: 04-1] "-1 for now because this still needs a password added for the `image_suggestions` role/user (and I don't have access to that repo)." [puppet] - 10https://gerrit.wikimedia.org/r/769587 (https://phabricator.wikimedia.org/T295405) (owner: 10Eevans) [01:00:04] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T0100). [01:20:56] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:39:41] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:40:30] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:26:41] PROBLEM - SSH on analytics1067.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:37:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:37:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [05:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [05:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T300775)', diff saved to https://phabricator.wikimedia.org/P22232 and previous config saved to /var/cache/conftool/dbconfig/20220310-053950-marostegui.json [05:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:53] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [05:40:56] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:42:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1133.eqiad.wmnet with reason: Maintenance [05:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1133.eqiad.wmnet with reason: Maintenance [05:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:51] !log dbmaint on s2@eqiad T272512 [05:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:55] T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken) - https://phabricator.wikimedia.org/T272512 [05:45:56] !log dbmaint on pc1@eqiad T272512 [05:45:57] !log dbmaint on pc2@eqiad T272512 [05:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:59] !log dbmaint on pc3@eqiad T272512 [05:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:02] !log dbmaint on s4@eqiad T272512 [05:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:46] !log dbmaint on s5@eqiad T272512 [05:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:53] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1099.eqiad.wmnet with reason: Maintenance [05:46:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1099.eqiad.wmnet with reason: Maintenance [05:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298294)', diff saved to https://phabricator.wikimedia.org/P22233 and previous config saved to /var/cache/conftool/dbconfig/20220310-054701-marostegui.json [05:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:05] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [05:53:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298294)', diff saved to https://phabricator.wikimedia.org/P22234 and previous config saved to /var/cache/conftool/dbconfig/20220310-055335-marostegui.json [05:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:40] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [06:02:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:03:07] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:15] !log dbmaint on s1@eqiad T272512 [06:05:16] !log dbmaint on s7@eqiad T272512 [06:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:19] T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken) - https://phabricator.wikimedia.org/T272512 [06:07:40] !log dbmaint on s3@eqiad T272512 [06:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P22235 and previous config saved to /var/cache/conftool/dbconfig/20220310-060840-marostegui.json [06:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:43] (03PS1) 10Marostegui: mariadb: Remove innodb_buffer_pool_instances [puppet] - 10https://gerrit.wikimedia.org/r/769610 (https://phabricator.wikimedia.org/T301879) [06:15:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove innodb_buffer_pool_instances [puppet] - 10https://gerrit.wikimedia.org/r/769610 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [06:21:55] (03PS1) 10Marostegui: db1132: Move it from m5 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/769612 (https://phabricator.wikimedia.org/T303395) [06:22:29] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:23:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P22236 and previous config saved to /var/cache/conftool/dbconfig/20220310-062345-marostegui.json [06:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:11] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:25:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:55] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:28:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance [06:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance [06:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [06:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [06:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1099.eqiad.wmnet with reason: Maintenance [06:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1099.eqiad.wmnet with reason: Maintenance [06:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T300775)', diff saved to https://phabricator.wikimedia.org/P22237 and previous config saved to /var/cache/conftool/dbconfig/20220310-063017-marostegui.json [06:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:20] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:32:37] (03CR) 10Marostegui: [C: 03+2] db1132: Move it from m5 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/769612 (https://phabricator.wikimedia.org/T303395) (owner: 10Marostegui) [06:33:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1132.eqiad.wmnet with OS bullseye [06:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T300775)', diff saved to https://phabricator.wikimedia.org/P22238 and previous config saved to /var/cache/conftool/dbconfig/20220310-063503-marostegui.json [06:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298294)', diff saved to https://phabricator.wikimedia.org/P22239 and previous config saved to /var/cache/conftool/dbconfig/20220310-063850-marostegui.json [06:38:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:38:55] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [06:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298294)', diff saved to https://phabricator.wikimedia.org/P22240 and previous config saved to /var/cache/conftool/dbconfig/20220310-063858-marostegui.json [06:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:16] PROBLEM - MariaDB Replica SQL: db_inventory #page on db2093 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1665, Errmsg: Error Cannot execute statement: impossible to write to binary log since BINLOG_FORMAT = STATEMENT and at least one table uses a storage engine limited to row-based logging. InnoDB is limited to row-logging when transaction isolation level is READ COMMITTED or READ UNCOMMITTED. on query. Default data [06:43:16] rcillo. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:43:26] ^ fixing [06:43:52] marostegui: <3 [06:44:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1132.eqiad.wmnet with reason: host reimage [06:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:48] <_joe_> rzl: why are you even here :D [06:45:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298294)', diff saved to https://phabricator.wikimedia.org/P22241 and previous config saved to /var/cache/conftool/dbconfig/20220310-064506-marostegui.json [06:45:08] marostegui: let is know if you need a hand [06:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:10] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [06:45:17] volans: thank you! it is now fixed [06:45:23] thx [06:45:33] <_joe_> marostegui: thanks [06:46:06] RECOVERY - MariaDB Replica SQL: db_inventory #page on db2093 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:46:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1132.eqiad.wmnet with reason: host reimage [06:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:21] _joe_: pager goes off, I take a look! heading to bed though 👋 [06:50:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P22242 and previous config saved to /var/cache/conftool/dbconfig/20220310-065009-marostegui.json [06:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:31] <_joe_> rzl: yeah I assumed people in PT never got paged after 6 am my time, sorry [07:00:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P22243 and previous config saved to /var/cache/conftool/dbconfig/20220310-070011-marostegui.json [07:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1132.eqiad.wmnet with OS bullseye [07:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:50] (03PS1) 10Marostegui: db_inventory.pp: Remove binlog override [puppet] - 10https://gerrit.wikimedia.org/r/769615 (https://phabricator.wikimedia.org/T303496) [07:04:21] (03CR) 10jerkins-bot: [V: 04-1] db_inventory.pp: Remove binlog override [puppet] - 10https://gerrit.wikimedia.org/r/769615 (https://phabricator.wikimedia.org/T303496) (owner: 10Marostegui) [07:05:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P22244 and previous config saved to /var/cache/conftool/dbconfig/20220310-070514-marostegui.json [07:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:28] (03CR) 10Marostegui: "PCC looks good: https://puppet-compiler.wmflabs.org/pcc-worker1001/34183/" [puppet] - 10https://gerrit.wikimedia.org/r/769615 (https://phabricator.wikimedia.org/T303496) (owner: 10Marostegui) [07:06:30] (03PS2) 10Marostegui: db_inventory.pp: Remove binlog override [puppet] - 10https://gerrit.wikimedia.org/r/769615 (https://phabricator.wikimedia.org/T303496) [07:07:04] (03CR) 10jerkins-bot: [V: 04-1] db_inventory.pp: Remove binlog override [puppet] - 10https://gerrit.wikimedia.org/r/769615 (https://phabricator.wikimedia.org/T303496) (owner: 10Marostegui) [07:10:29] (03PS3) 10Marostegui: db_inventory.pp: Remove binlog override [puppet] - 10https://gerrit.wikimedia.org/r/769615 (https://phabricator.wikimedia.org/T303496) [07:10:55] (03PS1) 10Elukey: Set bullseye + overlayfs settings for kubernetes2009 [puppet] - 10https://gerrit.wikimedia.org/r/769616 (https://phabricator.wikimedia.org/T300744) [07:10:57] (03PS1) 10Elukey: Set bullseye + overlayfs settings for kubernetes2010 [puppet] - 10https://gerrit.wikimedia.org/r/769617 (https://phabricator.wikimedia.org/T300744) [07:15:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P22245 and previous config saved to /var/cache/conftool/dbconfig/20220310-071516-marostegui.json [07:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:35] (03CR) 10Marostegui: [C: 03+2] db_inventory.pp: Remove binlog override [puppet] - 10https://gerrit.wikimedia.org/r/769615 (https://phabricator.wikimedia.org/T303496) (owner: 10Marostegui) [07:20:13] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:20:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T300775)', diff saved to https://phabricator.wikimedia.org/P22246 and previous config saved to /var/cache/conftool/dbconfig/20220310-072019-marostegui.json [07:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:23] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [07:20:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2079.codfw.wmnet with reason: Maintenance [07:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2079.codfw.wmnet with reason: Maintenance [07:20:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 12 hosts with reason: Maintenance [07:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 12 hosts with reason: Maintenance [07:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:21:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [07:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T300775)', diff saved to https://phabricator.wikimedia.org/P22247 and previous config saved to /var/cache/conftool/dbconfig/20220310-072124-marostegui.json [07:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:32] RECOVERY - SSH on analytics1067.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:30:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298294)', diff saved to https://phabricator.wikimedia.org/P22248 and previous config saved to /var/cache/conftool/dbconfig/20220310-073022-marostegui.json [07:30:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [07:30:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [07:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:27] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [07:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298294)', diff saved to https://phabricator.wikimedia.org/P22249 and previous config saved to /var/cache/conftool/dbconfig/20220310-073523-marostegui.json [07:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:27] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [07:37:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T300775)', diff saved to https://phabricator.wikimedia.org/P22250 and previous config saved to /var/cache/conftool/dbconfig/20220310-073708-marostegui.json [07:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:12] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [07:41:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298294)', diff saved to https://phabricator.wikimedia.org/P22251 and previous config saved to /var/cache/conftool/dbconfig/20220310-074118-marostegui.json [07:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:22] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [07:43:52] !log Reboot dbproxy2001, 2002, 2003, 2004 T303174 [07:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P22252 and previous config saved to /var/cache/conftool/dbconfig/20220310-075213-marostegui.json [07:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P22253 and previous config saved to /var/cache/conftool/dbconfig/20220310-075623-marostegui.json [07:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Amir1 and apergos: Your horoscope predicts another unfortunate UTC morning backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T0800). [08:00:04] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:18] no trainees today! [08:00:24] !log Reboot dbproxy1012, 1015, 1016 T303174 [08:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:27] one patch only for the window [08:00:51] patch seems fine to me [08:01:17] kart or kart_ is not here however, hopefully they will be here soon [08:02:18] ah kart_ there you are [08:02:23] Aha. Forgot about Backport deployment window :) [08:02:28] you're the only patch owner in the window [08:02:28] apergos: yep :) [08:02:31] no trainees today [08:02:36] self-deploying? [08:02:42] OK. Self deploy :) [08:02:50] ok! (I always forget so I always have to ask) [08:02:57] go for it! [08:03:08] !log Reboot dbproxy1017 1016 T303174 [08:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:39] (03PS2) 10KartikMistry: Enable SectionTranslation on Javanese, Tagalog, Mongolian, Telugu WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769386 (https://phabricator.wikimedia.org/T298237) [08:05:29] (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation on Javanese, Tagalog, Mongolian, Telugu WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769386 (https://phabricator.wikimedia.org/T298237) (owner: 10KartikMistry) [08:06:11] (03Merged) 10jenkins-bot: Enable SectionTranslation on Javanese, Tagalog, Mongolian, Telugu WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769386 (https://phabricator.wikimedia.org/T298237) (owner: 10KartikMistry) [08:06:29] (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/769654 (https://phabricator.wikimedia.org/T303174) [08:07:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P22254 and previous config saved to /var/cache/conftool/dbconfig/20220310-080718-marostegui.json [08:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:26] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for klaxon gunicorn webapp [puppet] - 10https://gerrit.wikimedia.org/r/767516 (https://phabricator.wikimedia.org/T135991) [08:10:32] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Abbe98) Is there any public documentation regarding the selection of the statuspage... [08:10:49] (03CR) 10Muehlenhoff: Enable profile::auto_restarts::service for klaxon gunicorn webapp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:11:09] (03PS1) 10Marostegui: db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/769655 (https://phabricator.wikimedia.org/T303395) [08:11:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:11:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P22255 and previous config saved to /var/cache/conftool/dbconfig/20220310-081129-marostegui.json [08:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Set bullseye + overlayfs settings for kubernetes2009 [puppet] - 10https://gerrit.wikimedia.org/r/769616 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:12:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099 (s1, s8) for reboot', diff saved to https://phabricator.wikimedia.org/P22256 and previous config saved to /var/cache/conftool/dbconfig/20220310-081244-marostegui.json [08:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:01] (03PS3) 10Muehlenhoff: Add Cumin alias for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/767709 [08:13:03] (03CR) 10Marostegui: [C: 03+2] db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/769655 (https://phabricator.wikimedia.org/T303395) (owner: 10Marostegui) [08:14:01] I'm still testing on mwdebug.. [08:15:35] ok! [08:17:15] (03PS1) 10KartikMistry: SectionTranslation: Also add languages to target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769656 (https://phabricator.wikimedia.org/T298237) [08:17:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:18] apergos: I need to deploy other patch also to fix target languages. [08:17:41] put it on the calendar real fast then and let's get it in [08:18:14] OK! [08:18:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:18:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:11] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: Part 1: [[gerrit:769386|Enable SectionTranslation on Javanese, Tagalog, Mongolian, Telugu WPs (T298237)]] (duration: 00m 50s) [08:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:14] T298237: Enable Section Translation on Javanese, Tagalog, Mongolian and Telugu Wikipedias - https://phabricator.wikimedia.org/T298237 [08:19:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:28] (03CR) 10Volans: "minor nits inline, looks good otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/769572 (owner: 10Jbond) [08:20:13] ok [08:20:14] apergos: Added. Will start deploy after +2 of the patch. [08:20:15] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/767709 (owner: 10Muehlenhoff) [08:20:17] right [08:20:45] (03CR) 10KartikMistry: [C: 03+2] SectionTranslation: Also add languages to target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769656 (https://phabricator.wikimedia.org/T298237) (owner: 10KartikMistry) [08:21:27] (03Merged) 10jenkins-bot: SectionTranslation: Also add languages to target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769656 (https://phabricator.wikimedia.org/T298237) (owner: 10KartikMistry) [08:22:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T300775)', diff saved to https://phabricator.wikimedia.org/P22258 and previous config saved to /var/cache/conftool/dbconfig/20220310-082223-marostegui.json [08:22:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [08:22:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [08:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 10%: After reboot5', diff saved to https://phabricator.wikimedia.org/P22259 and previous config saved to /var/cache/conftool/dbconfig/20220310-082227-root.json [08:22:31] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [08:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T300775)', diff saved to https://phabricator.wikimedia.org/P22260 and previous config saved to /var/cache/conftool/dbconfig/20220310-082234-marostegui.json [08:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:31] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: Part 2: [[gerrit:769656|SectionTranslation: Also add languages to target (T298237)]] (duration: 00m 49s) [08:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:34] T298237: Enable Section Translation on Javanese, Tagalog, Mongolian and Telugu Wikipedias - https://phabricator.wikimedia.org/T298237 [08:25:05] (03CR) 10Volans: "some questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [08:25:24] apergos: all done. [08:25:31] nice! [08:25:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:25:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298294)', diff saved to https://phabricator.wikimedia.org/P22261 and previous config saved to /var/cache/conftool/dbconfig/20220310-082634-marostegui.json [08:26:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [08:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [08:26:38] last call for anyone wanting to sneak in a patch, there's probably time for a couple more config patches [08:26:38] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298294)', diff saved to https://phabricator.wikimedia.org/P22262 and previous config saved to /var/cache/conftool/dbconfig/20220310-082642-marostegui.json [08:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:54] enwiki says: upstream connect error or disconnect/reset before headers. reset reason: overflow [08:27:05] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.04972 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [08:27:17] yo [08:27:21] * volans here [08:27:26] <_joe_> si [08:27:28] <_joe_> gh [08:27:34] it seems to be db1099:3318 [08:27:36] depooling it [08:27:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318', diff saved to https://phabricator.wikimedia.org/P22263 and previous config saved to /var/cache/conftool/dbconfig/20220310-082737-marostegui.json [08:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:14] PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:28:22] PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:28:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:29:07] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.04737 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [08:29:08] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:29:32] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [08:29:32] PROBLEM - phpfpm_up reduced availability on alert1001 is CRITICAL: 0.7464 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:29:35] So maybe it wasn't db1099:3318 then... [08:29:49] <_joe_> marostegui: I doubt it is [08:29:50] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4677 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:29:52] yeah I was about to say that I am not sure it's a db that's to blame [08:30:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:30:04] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [08:30:09] akosiaris: yeah, i repooled it after an upgrade so I thought it could be it, but it is clearly not [08:30:31] idle php-fpm workers are increasing again, we should recover shortly [08:30:32] kart_: not liking the cxserver problem alert I see ^^ [08:30:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [08:30:54] RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:31:00] RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:31:02] yeah, recoveries incoming [08:31:18] apergos: I think cx alerting (among other stuff) is a result of mediawiki having problems [08:31:22] <_joe_> yes [08:31:24] look at the URL [08:31:25] <_joe_> the api is slow [08:31:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:36] <_joe_> so cxserver and everything else is a problem [08:31:39] it's /v2/suggest/sections/{title}, so it's trying to fetch something from mw [08:31:39] might be, let's see if it recovers with the rest [08:31:43] <_joe_> uh who did do a deplopyment? [08:31:48] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 543 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:31:53] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.8157 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [08:31:54] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:32:01] that would be kart_ with section translation config change, that's all [08:32:06] <_joe_> yes [08:32:18] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [08:32:18] RECOVERY - phpfpm_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:32:38] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:32:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298294)', diff saved to https://phabricator.wikimedia.org/P22264 and previous config saved to /var/cache/conftool/dbconfig/20220310-083244-marostegui.json [08:32:45] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.805 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [08:32:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:32:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:48] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:32:48] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [08:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:34:40] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:35:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (3) Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [08:37:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: After reboot5', diff saved to https://phabricator.wikimedia.org/P22265 and previous config saved to /var/cache/conftool/dbconfig/20220310-083732-root.json [08:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:16] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [08:40:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (3) Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [08:41:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318', diff saved to https://phabricator.wikimedia.org/P22266 and previous config saved to /var/cache/conftool/dbconfig/20220310-084139-marostegui.json [08:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:04] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [08:42:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T300775)', diff saved to https://phabricator.wikimedia.org/P22267 and previous config saved to /var/cache/conftool/dbconfig/20220310-084219-marostegui.json [08:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:23] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [08:43:08] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:43:13] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is CRITICAL: 0.1682 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [08:43:27] well I'm calling the window closed since we had no more takers for another patch and we have an ongoing incident [08:43:54] !log UTC morning backport and config window completed [08:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:06] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:44:12] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [08:46:03] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.8044 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver [08:46:34] Ouch, but that doesn't seems related to config change for sure. [08:46:56] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:47:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:36] (03PS1) 10Ayounsi: labs-in filter: remove PXE term [homer/public] - 10https://gerrit.wikimedia.org/r/769657 [08:47:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P22268 and previous config saved to /var/cache/conftool/dbconfig/20220310-084749-marostegui.json [08:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:48] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:50:08] @akosiaris Recovery on? [08:52:20] It's flapping I'd say [08:52:36] Still not clear what's going on [08:57:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P22269 and previous config saved to /var/cache/conftool/dbconfig/20220310-085724-marostegui.json [08:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:04] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:01:35] (03CR) 10Ayounsi: [C: 03+1] "Reviewed and tested locally, LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/769478 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [09:02:09] kart_: you can resume deploying (unless you're already done), things are under control [09:02:40] moritzm: Already finished before outage :) [09:16:19] !log failover ganeti master for drmrs/B12 to ganeti6003 [09:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298294)', diff saved to https://phabricator.wikimedia.org/P22272 and previous config saved to /var/cache/conftool/dbconfig/20220310-091759-marostegui.json [09:18:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:18:03] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298294)', diff saved to https://phabricator.wikimedia.org/P22273 and previous config saved to /var/cache/conftool/dbconfig/20220310-091807-marostegui.json [09:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:58] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs settings for kubernetes2009 [puppet] - 10https://gerrit.wikimedia.org/r/769616 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:20:26] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10ayounsi) I added Suman to those 3 domains. Please let me know if it works as expected or if there are any issues. I agree a different task would keep things cleaner. Thanks! [09:21:48] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10ayounsi) a:03SCherukuwada [09:22:56] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2009.codfw.wmnet with OS bullseye [09:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:35] (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 [09:24:37] (03PS1) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [09:25:18] (03CR) 10jerkins-bot: [V: 04-1] external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 (owner: 10Giuseppe Lavagetto) [09:26:06] (03CR) 10JMeybohm: [C: 03+1] Set bullseye + overlayfs settings for kubernetes2010 [puppet] - 10https://gerrit.wikimedia.org/r/769617 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:26:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298294)', diff saved to https://phabricator.wikimedia.org/P22274 and previous config saved to /var/cache/conftool/dbconfig/20220310-092610-marostegui.json [09:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:14] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [09:27:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T300775)', diff saved to https://phabricator.wikimedia.org/P22275 and previous config saved to /var/cache/conftool/dbconfig/20220310-092735-marostegui.json [09:27:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [09:27:39] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [09:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T300775)', diff saved to https://phabricator.wikimedia.org/P22276 and previous config saved to /var/cache/conftool/dbconfig/20220310-092742-marostegui.json [09:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:46] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10ayounsi) 05In progress→03Stalled [09:27:59] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10ayounsi) [09:28:01] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10ayounsi) [09:28:38] (03PS2) 10Giuseppe Lavagetto: external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 [09:28:40] (03PS2) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [09:29:02] PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:29:19] (03CR) 10jerkins-bot: [V: 04-1] external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 (owner: 10Giuseppe Lavagetto) [09:29:26] (KubernetesCalicoDown) firing: kubernetes2009.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:30:02] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:02] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:55] (03PS2) 10Jbond: external_cloud_vendors: some addtional follow up fixes [puppet] - 10https://gerrit.wikimedia.org/r/769572 [09:31:04] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/769572 (owner: 10Jbond) [09:32:32] (03PS3) 10Giuseppe Lavagetto: external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 [09:32:34] (03PS3) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [09:33:13] (03CR) 10jerkins-bot: [V: 04-1] external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 (owner: 10Giuseppe Lavagetto) [09:34:54] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10ayounsi) I see in T302617#7756075 that Bing got completed. @SCherukuwada, could you take care of granting access to @AndyRussG for this one of request? Going forward I don't know w... [09:35:06] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10ayounsi) a:03SCherukuwada [09:36:51] (03CR) 10Jbond: [C: 03+1] "LGTM minus the ci issue" [puppet] - 10https://gerrit.wikimedia.org/r/769660 (owner: 10Giuseppe Lavagetto) [09:38:09] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10jcrespo) @ayounsi A proper process has to be setup for this, which is not yet in place, it is being worked now, as you said, at T302617. [09:38:27] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2009.codfw.wmnet with reason: host reimage [09:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:08] (03CR) 10Jbond: "LGTM, the open question can also be hendled in a later patch set" [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [09:40:56] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:40:58] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2009.codfw.wmnet with reason: host reimage [09:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P22277 and previous config saved to /var/cache/conftool/dbconfig/20220310-094115-marostegui.json [09:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:26] (KubernetesCalicoDown) resolved: kubernetes2009.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:53:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2009.codfw.wmnet with OS bullseye [09:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P22278 and previous config saved to /var/cache/conftool/dbconfig/20220310-095620-marostegui.json [09:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:16] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10akosiaris) We had a occurrence of this a couple of hours ago, will post more details soon. [10:00:45] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:04:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [10:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:51] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs settings for kubernetes2010 [puppet] - 10https://gerrit.wikimedia.org/r/769617 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:04:59] (03PS2) 10Elukey: Set bullseye + overlayfs settings for kubernetes2010 [puppet] - 10https://gerrit.wikimedia.org/r/769617 (https://phabricator.wikimedia.org/T300744) [10:05:08] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) 05Open→03Stalled I suggest we first upgrade Netbox to 3.1 (or most likely 3.2 by the time T296452) that will allow us to cle... [10:05:38] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) [10:07:08] 10SRE, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) -sre-access-requests for now, to clear the Clinic Duty dashboard. [10:07:33] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:08:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This change makes sense to me, but please collect +1 from other engineer that knows more than I do about the PXE setup to make sure we're " [homer/public] - 10https://gerrit.wikimedia.org/r/769657 (owner: 10Ayounsi) [10:08:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [10:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:58] 10SRE, 10SRE-Access-Requests, 10SRE-OnFire, 10WMF-Legal: Grant Zabe access to the T302047 gdoc incident report - https://phabricator.wikimedia.org/T302163 (10ayounsi) 05Open→03Stalled a:03Zabe @Zabe please let us know when you get any feedback, we will do the same. [10:10:28] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes2010.codfw.wmnet with OS bullseye [10:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298294)', diff saved to https://phabricator.wikimedia.org/P22279 and previous config saved to /var/cache/conftool/dbconfig/20220310-101125-marostegui.json [10:11:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [10:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [10:11:29] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [10:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298294)', diff saved to https://phabricator.wikimedia.org/P22280 and previous config saved to /var/cache/conftool/dbconfig/20220310-101133-marostegui.json [10:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:27] (03CR) 10DCausse: elastic: relax & restore perms during upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [10:15:39] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:15:53] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:21] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate if stopping mysql with buffer_pool dump between 10.4 versions is safe - https://phabricator.wikimedia.org/T303498 (10Volans) [10:17:26] (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:17:27] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10akosiaris) >>! In T303049#7753274, @BTullis wrote: > How can I tell what the source IP address(es) of my services will be, as seen by the bac... [10:17:29] (03CR) 10Jbond: O:external_clouds_vendors: New module for fetching cloud networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769410 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [10:17:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298294)', diff saved to https://phabricator.wikimedia.org/P22281 and previous config saved to /var/cache/conftool/dbconfig/20220310-101738-marostegui.json [10:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:43] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [10:17:55] (03CR) 10Gehel: [C: 03+1] "Looks good in principle" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769075 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [10:18:22] 10SRE-OnFire, 10DBA, 10Performance-Team, 10Wikimedia-Rdbms, and 2 others: 2022-03-10 Mediawiki availability afected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10Volans) [10:20:07] (03PS1) 10Alexandros Kosiaris: Merge rdb2008 and rdb2010 site.pp stanzas [puppet] - 10https://gerrit.wikimedia.org/r/769665 [10:20:09] (03PS1) 10Alexandros Kosiaris: ferm: SERVICES_KUBEPODS_NETWORKS to WIKIKUBE_KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/769666 [10:20:53] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) Great. Thanks both. I'm now working through the first set of comments left by @JMeybohm on the patch, trying to make it use the scaf... [10:21:16] (03CR) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [10:21:46] (03PS4) 10Giuseppe Lavagetto: external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 [10:21:48] (03PS4) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [10:21:50] (03PS1) 10Giuseppe Lavagetto: C:varnish: add the external cloud vendors file to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769667 (https://phabricator.wikimedia.org/T270391) [10:23:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [10:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:10] (03CR) 10jerkins-bot: [V: 04-1] C:varnish: add the external cloud vendors file to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769667 (https://phabricator.wikimedia.org/T270391) (owner: 10Giuseppe Lavagetto) [10:26:00] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2010.codfw.wmnet with reason: host reimage [10:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:29] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769572 (owner: 10Jbond) [10:27:41] (KubernetesCalicoDown) resolved: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:27:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T300775)', diff saved to https://phabricator.wikimedia.org/P22282 and previous config saved to /var/cache/conftool/dbconfig/20220310-102757-marostegui.json [10:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:01] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [10:28:12] (03CR) 10Jbond: [C: 03+1] external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 (owner: 10Giuseppe Lavagetto) [10:29:02] (03CR) 10Jbond: [C: 03+1] puppetmaster: download public clouds, upload to etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [10:29:13] (03CR) 10Giuseppe Lavagetto: external_clouds_vendors: Add ability to save data to conftool too. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769660 (owner: 10Giuseppe Lavagetto) [10:29:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [10:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2010.codfw.wmnet with reason: host reimage [10:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:03] (03CR) 10Jbond: [C: 04-1] "this will probably get superseeded by the following so -1 for now" [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [10:30:21] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10akosiaris) As far as I am concerned, this service request LGTM. Thanks for the very detailed diagram (including a link to the source), repos... [10:30:42] !log failover ganeti master for drmrs/B13 to ganeti6004 [10:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:11] (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:32:00] (03PS2) 10Jbond: C:varnish: add the external cloud vendors file to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769667 (https://phabricator.wikimedia.org/T270391) (owner: 10Giuseppe Lavagetto) [10:32:26] (03CR) 10Jbond: [C: 03+1] "LGTM but may be worth getting some go template expert to check as well" [puppet] - 10https://gerrit.wikimedia.org/r/769667 (https://phabricator.wikimedia.org/T270391) (owner: 10Giuseppe Lavagetto) [10:32:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P22283 and previous config saved to /var/cache/conftool/dbconfig/20220310-103243-marostegui.json [10:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:42] 10SRE-swift-storage: Fully rebalance production rings - https://phabricator.wikimedia.org/T303507 (10MatthewVernon) [10:38:39] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10akosiaris) >>! In T301505#7766564, @akosiaris wrote: > We had a occurrence of this a couple of hours ago, will post more details soon... [10:40:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:41:11] (KubernetesCalicoDown) resolved: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:41:45] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:pupetmaster: add support for netbox-hiera git repo [puppet] - 10https://gerrit.wikimedia.org/r/769538 (owner: 10Jbond) [10:42:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2010.codfw.wmnet with OS bullseye [10:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P22284 and previous config saved to /var/cache/conftool/dbconfig/20220310-104302-marostegui.json [10:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:19] !log disable puppet fleet wide [10:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:53] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10akosiaris) p:05High→03Low [10:44:53] !log reboot rdb2009 for upgrades [10:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:55] hi all i just deployed a patch which triggered an apache restart on the puppet masteres as such expecte the widspread puppet error to trigger soon [10:45:06] ouch [10:45:10] (03PS1) 10MVernon: codfw-prod: rebalance the rings [software/swift-ring] - 10https://gerrit.wikimedia.org/r/769671 (https://phabricator.wikimedia.org/T303507) [10:45:12] it can be ignored and ill push to recover it soon [10:45:22] for now disabling puppet to stop the blleeding [10:45:22] ok, noted. Thanks for the heads up [10:45:57] (03CR) 10MVernon: "Hi," [software/swift-ring] - 10https://gerrit.wikimedia.org/r/769671 (https://phabricator.wikimedia.org/T303507) (owner: 10MVernon) [10:47:13] PROBLEM - Host rdb2009 is DOWN: PING CRITICAL - Packet loss = 100% [10:47:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [10:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:25] RECOVERY - Host rdb2009 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [10:47:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P22285 and previous config saved to /var/cache/conftool/dbconfig/20220310-104748-marostegui.json [10:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:52] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) I'd like to add the proposal of using Ingress (T290966) for the frontend (to not have to configure LVS for that). For the consumers... [10:48:18] !log re-enable puppet fleet wide [10:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:31] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:58:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P22286 and previous config saved to /var/cache/conftool/dbconfig/20220310-105807-marostegui.json [10:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:40] (03PS5) 10Giuseppe Lavagetto: external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 [10:58:42] (03PS5) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [10:58:44] (03PS3) 10Giuseppe Lavagetto: C:varnish: add the external cloud vendors file to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769667 (https://phabricator.wikimedia.org/T270391) [10:59:21] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: add thanos S3 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/769075 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [11:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T1100). [11:00:54] (03PS1) 10Mvolz: Update Zotero to 4c177291e5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769675 [11:01:46] (03CR) 10DCausse: [C: 04-2] "forgot to update the chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769075 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [11:01:57] (03PS1) 10Volans: alertmanager: do not retry on HTTP 500 response [software/spicerack] - 10https://gerrit.wikimedia.org/r/769676 [11:02:00] (03CR) 10Mvolz: [C: 03+2] Update Zotero to 4c177291e5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769675 (owner: 10Mvolz) [11:02:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298294)', diff saved to https://phabricator.wikimedia.org/P22287 and previous config saved to /var/cache/conftool/dbconfig/20220310-110253-marostegui.json [11:02:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:02:57] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [11:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:08] (03PS3) 10DCausse: flink-session-cluster: add thanos S3 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/769075 (https://phabricator.wikimedia.org/T302494) [11:04:59] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:06] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) > I'd like to add the proposal of using Ingress (T290966) for the frontend (to not have to configure LVS for that). Sounds good to m... [11:06:08] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:13] (03Merged) 10jenkins-bot: Update Zotero to 4c177291e5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769675 (owner: 10Mvolz) [11:08:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:08:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on 14 hosts with reason: Maintenance [11:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on 14 hosts with reason: Maintenance [11:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:19] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) > Per my undestanding the service will reside in the wikikube cluster for the MVP phase, despite being a bad fit for it per https://... [11:09:01] PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:09:01] 10SRE-swift-storage, 10Patch-For-Review: Fully rebalance production rings - https://phabricator.wikimedia.org/T303507 (10krillrivera) [11:09:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [11:10:00] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp-test1001.wikimedia.org [11:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:10] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:12] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [11:13:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T300775)', diff saved to https://phabricator.wikimedia.org/P22289 and previous config saved to /var/cache/conftool/dbconfig/20220310-111313-marostegui.json [11:13:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1126.eqiad.wmnet with reason: Maintenance [11:13:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [11:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:13:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1126.eqiad.wmnet with reason: Maintenance [11:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:19] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [11:13:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T300775)', diff saved to https://phabricator.wikimedia.org/P22290 and previous config saved to /var/cache/conftool/dbconfig/20220310-111320-marostegui.json [11:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298294)', diff saved to https://phabricator.wikimedia.org/P22291 and previous config saved to /var/cache/conftool/dbconfig/20220310-111330-marostegui.json [11:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:34] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [11:14:24] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1001.wikimedia.org [11:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:34] (03PS1) 10Marostegui: db1132: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/769677 (https://phabricator.wikimedia.org/T303395) [11:15:15] (03CR) 10Marostegui: [C: 03+2] db1132: Install MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/769677 (https://phabricator.wikimedia.org/T303395) (owner: 10Marostegui) [11:16:21] !log jmm@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic1093.eqiad.wmnet [11:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T300775)', diff saved to https://phabricator.wikimedia.org/P22292 and previous config saved to /var/cache/conftool/dbconfig/20220310-111705-marostegui.json [11:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:52] (03PS6) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [11:17:54] (03PS4) 10Giuseppe Lavagetto: C:varnish: add the external cloud vendors file to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769667 (https://phabricator.wikimedia.org/T270391) [11:18:31] !log rolled out python3-wmflib v1.1.2 to the entire fleet (buster+ only) [11:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] external_clouds_vendors: Add ability to save data to conftool too. [puppet] - 10https://gerrit.wikimedia.org/r/769660 (owner: 10Giuseppe Lavagetto) [11:19:44] (03PS1) 10Mvolz: Revert "Update Zotero to 4c177291e5" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769556 [11:19:57] (03CR) 10Mvolz: [C: 03+2] Revert "Update Zotero to 4c177291e5" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769556 (owner: 10Mvolz) [11:20:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [11:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:02] (03CR) 10Jbond: [C: 03+1] puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [11:23:31] (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/769678 [11:23:52] (03Merged) 10jenkins-bot: Revert "Update Zotero to 4c177291e5" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769556 (owner: 10Mvolz) [11:24:24] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:37] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:45] (03CR) 10Jbond: [C: 03+1] external_clouds_vendors: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/769678 (owner: 10Giuseppe Lavagetto) [11:24:58] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:07] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1093.eqiad.wmnet [11:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [11:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:44] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:17] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:44] (03PS2) 10Ladsgroup: db1141: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768651 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [11:26:48] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1141: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/768651 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [11:26:50] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:00] 10SRE-swift-storage, 10Patch-For-Review: Fully rebalance production rings - https://phabricator.wikimedia.org/T303507 (10krillrivera) [11:29:12] !log ebysans@deploy1002 Started deploy [airflow-dags/analytics@b681376]: (no justification provided) [11:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:19] !log ebysans@deploy1002 Finished deploy [airflow-dags/analytics@b681376]: (no justification provided) (duration: 00m 07s) [11:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:24] (03PS2) 10Giuseppe Lavagetto: external_clouds_vendors: fix small errors [puppet] - 10https://gerrit.wikimedia.org/r/769678 [11:29:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] external_clouds_vendors: fix small errors [puppet] - 10https://gerrit.wikimedia.org/r/769678 (owner: 10Giuseppe Lavagetto) [11:30:23] (03CR) 10Jbond: [C: 03+1] external_clouds_vendors: fix small errors [puppet] - 10https://gerrit.wikimedia.org/r/769678 (owner: 10Giuseppe Lavagetto) [11:31:18] (03CR) 10Mvolz: "Any idea how to debug this issue? The PR works locally, but when I deployed this to staging, requests to the staging server would time out" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769556 (owner: 10Mvolz) [11:32:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P22293 and previous config saved to /var/cache/conftool/dbconfig/20220310-113210-marostegui.json [11:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:29] (03PS1) 10Giuseppe Lavagetto: conftool-data: add ranges for the cloud ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/769679 [11:34:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool-data: add ranges for the cloud ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/769679 (owner: 10Giuseppe Lavagetto) [11:34:52] (03PS3) 10Jbond: external_cloud_vendors: some addtional follow up fixes [puppet] - 10https://gerrit.wikimedia.org/r/769572 [11:35:56] (03PS1) 10Marostegui: change_old_flags_T298563.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769680 (https://phabricator.wikimedia.org/T298563) [11:36:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [11:36:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [11:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T302950)', diff saved to https://phabricator.wikimedia.org/P22294 and previous config saved to /var/cache/conftool/dbconfig/20220310-113638-ladsgroup.json [11:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:42] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [11:38:24] (03PS4) 10Jbond: external_cloud_vendors: some addtional follow up fixes [puppet] - 10https://gerrit.wikimedia.org/r/769572 [11:38:46] (03CR) 10Jbond: [C: 03+2] external_cloud_vendors: some addtional follow up fixes [puppet] - 10https://gerrit.wikimedia.org/r/769572 (owner: 10Jbond) [11:38:55] (03CR) 10Jbond: [V: 03+2 C: 03+2] external_cloud_vendors: some addtional follow up fixes [puppet] - 10https://gerrit.wikimedia.org/r/769572 (owner: 10Jbond) [11:42:23] (03PS7) 10Jbond: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [11:43:11] (03CR) 10Ladsgroup: [C: 03+1] change_old_flags_T298563.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769680 (https://phabricator.wikimedia.org/T298563) (owner: 10Marostegui) [11:43:54] (03CR) 10Marostegui: [C: 03+2] change_old_flags_T298563.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769680 (https://phabricator.wikimedia.org/T298563) (owner: 10Marostegui) [11:44:17] (03Merged) 10jenkins-bot: change_old_flags_T298563.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/769680 (https://phabricator.wikimedia.org/T298563) (owner: 10Marostegui) [11:45:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1141.eqiad.wmnet with OS bullseye [11:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P22296 and previous config saved to /var/cache/conftool/dbconfig/20220310-114715-marostegui.json [11:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:15] (03PS8) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [11:53:07] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7766702, @BTullis wrote: >> I'd like to add the proposal of using Ingress (T290966) for the frontend (to not have to... [11:53:12] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 7 hosts with reason: Reboots [11:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:16] 10SRE-swift-storage, 10Patch-For-Review: Fully rebalance production rings - https://phabricator.wikimedia.org/T303507 (10MatthewVernon) [11:53:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Reboots [11:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:30] (03CR) 10Jcrespo: [C: 03+1] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/769654 (https://phabricator.wikimedia.org/T303174) (owner: 10Marostegui) [11:57:03] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/769654 (https://phabricator.wikimedia.org/T303174) (owner: 10Marostegui) [11:57:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1141.eqiad.wmnet with reason: host reimage [11:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:27] (03PS9) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [11:58:26] !log Failover m1 master [11:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1141.eqiad.wmnet with reason: host reimage [12:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:20] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 71 probes of 668 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:02:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T300775)', diff saved to https://phabricator.wikimedia.org/P22297 and previous config saved to /var/cache/conftool/dbconfig/20220310-120221-marostegui.json [12:02:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1114.eqiad.wmnet with reason: Maintenance [12:02:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1114.eqiad.wmnet with reason: Maintenance [12:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:27] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [12:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T300775)', diff saved to https://phabricator.wikimedia.org/P22298 and previous config saved to /var/cache/conftool/dbconfig/20220310-120228-marostegui.json [12:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:20] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:06:16] (03PS10) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [12:06:30] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 59 probes of 668 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:08:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34187/console" [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [12:09:02] (03CR) 10Jbond: [C: 03+1] "LGTM optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [12:09:30] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) > Sorry, totally my fault! I meant the GMS, not consumer. From what you wrote in T301454#7741876 it sounds like you just don't want... [12:10:17] (03PS11) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [12:10:42] (03PS13) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [12:11:22] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34188/console" [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [12:12:11] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [12:13:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] puppetmaster: download public clouds, upload to etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [12:13:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298294)', diff saved to https://phabricator.wikimedia.org/P22299 and previous config saved to /var/cache/conftool/dbconfig/20220310-121344-marostegui.json [12:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:49] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:14:19] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 7 hosts with reason: Reboots [12:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Reboots [12:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:49] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) >>! In T303049#7766821, @BTullis wrote: > Yes, that's right. Great! >>! In T303049#7766821, @BTullis wrote: > So I'll change the `... [12:15:44] (03PS12) 10Giuseppe Lavagetto: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 [12:16:34] (03PS13) 10Jbond: puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [12:16:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1141.eqiad.wmnet with OS bullseye [12:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:30] (03CR) 10Jbond: [C: 03+1] "LGTM i made a minor edit the command should be run with -vv i think one v got dropped between a rebase (could have been one of my rebases)" [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [12:17:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34189/console" [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [12:19:27] (03PS14) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [12:19:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] puppetmaster: download public clouds, upload to etcd [puppet] - 10https://gerrit.wikimedia.org/r/769661 (owner: 10Giuseppe Lavagetto) [12:20:27] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (0313 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [12:26:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T302950)', diff saved to https://phabricator.wikimedia.org/P22300 and previous config saved to /var/cache/conftool/dbconfig/20220310-122659-ladsgroup.json [12:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:03] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [12:28:47] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P22301 and previous config saved to /var/cache/conftool/dbconfig/20220310-122850-marostegui.json [12:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:17] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:30:25] 10Puppet, 10Infrastructure-Foundations, 10observability, 10cloud-services-team (Kanban): 2 systemctl services failing on cloudcontrol hosts: prometheus-openstack-exporter and logrotate - https://phabricator.wikimedia.org/T303511 (10jcrespo) [12:30:30] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10SCherukuwada) Update: @AndyRussG already has access to Bing, as of yesterday. I'm working on a process to follow if/when more people request access. [12:31:31] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10SCherukuwada) 05Stalled→03Resolved [12:31:36] 10Puppet, 10Infrastructure-Foundations, 10observability, 10cloud-services-team (Kanban): 2 systemctl services failing on cloudcontrol hosts: prometheus-openstack-exporter and logrotate - https://phabricator.wikimedia.org/T303511 (10jcrespo) Feel free to merge if this is the same as the other ticket I refer... [12:39:38] (03PS24) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [12:41:58] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) > Actually I was just referring to the diagram, as is mentions specific ports and I wanted to make sure that's not a fixed requirem... [12:42:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P22302 and previous config saved to /var/cache/conftool/dbconfig/20220310-124204-ladsgroup.json [12:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P22303 and previous config saved to /var/cache/conftool/dbconfig/20220310-124355-marostegui.json [12:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:39] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf group / analytics-admins / analytics-privatedata-users for NOkafor - https://phabricator.wikimedia.org/T303512 (10NOkafor-WMF) [12:46:39] 10SRE: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10SCherukuwada) [12:46:48] (03PS15) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [12:49:13] (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:50:21] (03PS16) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [12:51:41] 10SRE: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10SCherukuwada) [12:51:44] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10SCherukuwada) [12:51:46] 10SRE: Domain Ownership Verification on Various Search Properties - https://phabricator.wikimedia.org/T302617 (10SCherukuwada) [12:53:03] (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:53:04] 10SRE: Update Documentation and Process for Access to Search Consoles - https://phabricator.wikimedia.org/T303513 (10SCherukuwada) [12:53:23] 10SRE, 10SRE-Access-Requests: Request Administrator Access to Google Search Console - https://phabricator.wikimedia.org/T302625 (10SCherukuwada) 05Open→03Resolved Works as expected, thank you. [12:57:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P22304 and previous config saved to /var/cache/conftool/dbconfig/20220310-125709-ladsgroup.json [12:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298294)', diff saved to https://phabricator.wikimedia.org/P22305 and previous config saved to /var/cache/conftool/dbconfig/20220310-125901-marostegui.json [12:59:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:59:05] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [12:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298294)', diff saved to https://phabricator.wikimedia.org/P22306 and previous config saved to /var/cache/conftool/dbconfig/20220310-125909-marostegui.json [12:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T300775)', diff saved to https://phabricator.wikimedia.org/P22307 and previous config saved to /var/cache/conftool/dbconfig/20220310-130243-marostegui.json [13:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:47] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:05:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298294)', diff saved to https://phabricator.wikimedia.org/P22308 and previous config saved to /var/cache/conftool/dbconfig/20220310-130523-marostegui.json [13:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:27] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [13:10:36] 10SRE-Access-Requests: Requesting access to RESOURCE for NOkafor - https://phabricator.wikimedia.org/T303516 (10NOkafor-WMF) [13:11:09] 10SRE-Access-Requests: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10NOkafor-WMF) [13:11:40] 10SRE, 10SRE-Access-Requests: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10RhinosF1) [13:12:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T302950)', diff saved to https://phabricator.wikimedia.org/P22309 and previous config saved to /var/cache/conftool/dbconfig/20220310-131214-ladsgroup.json [13:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:19] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [13:12:37] 10SRE, 10SRE-Access-Requests: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10NOkafor-WMF) [13:13:29] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: add thanos S3 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/769075 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [13:16:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [13:17:17] (03Merged) 10jenkins-bot: flink-session-cluster: add thanos S3 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/769075 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [13:17:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P22310 and previous config saved to /var/cache/conftool/dbconfig/20220310-131748-marostegui.json [13:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:44] (03CR) 10Tchanders: [C: 03+1] beta: Include mediawiki.ipinfo_interaction in $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769484 (owner: 10Phuedx) [13:20:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P22311 and previous config saved to /var/cache/conftool/dbconfig/20220310-132029-marostegui.json [13:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:58] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [13:22:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [13:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T302950)', diff saved to https://phabricator.wikimedia.org/P22313 and previous config saved to /var/cache/conftool/dbconfig/20220310-132234-ladsgroup.json [13:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:37] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [13:22:44] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:02] (03PS17) 10Jbond: sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) [13:25:21] (03CR) 10Cathal Mooney: [C: 03+2] New function and changes to wmf-netbox plugin to support EVPN config. (034 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:25:50] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] New function and changes to wmf-netbox plugin to support EVPN config. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:26:27] (03CR) 10jerkins-bot: [V: 04-1] sre.puppet.sync-netbox-hiera: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:26:54] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=ores,name=codfw [13:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:22] !log depool ores in codfw from discovery records to initiate reboot of rdb2007 [13:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:57] (03PS1) 10Jbond: P:puppet_compiler: add netbox checkout to puppet compilers [puppet] - 10https://gerrit.wikimedia.org/r/769696 [13:32:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P22314 and previous config saved to /var/cache/conftool/dbconfig/20220310-133254-marostegui.json [13:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P22315 and previous config saved to /var/cache/conftool/dbconfig/20220310-133534-marostegui.json [13:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:56] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:41:52] 10SRE-OnFire, 10DBA, 10Performance-Team, 10Wikimedia-Rdbms, and 2 others: 2022-03-10 MediaWiki availability afected due to a database query processing slowdown affecting most of the rest of the database infrastructure - https://phabricator.wikimedia.org/T303499 (10Reedy) [13:43:27] !log reboot rdb2007 for upgrades [13:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T300775)', diff saved to https://phabricator.wikimedia.org/P22316 and previous config saved to /var/cache/conftool/dbconfig/20220310-134759-marostegui.json [13:48:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1111.eqiad.wmnet with reason: Maintenance [13:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1111.eqiad.wmnet with reason: Maintenance [13:48:04] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T300775)', diff saved to https://phabricator.wikimedia.org/P22317 and previous config saved to /var/cache/conftool/dbconfig/20220310-134807-marostegui.json [13:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298294)', diff saved to https://phabricator.wikimedia.org/P22318 and previous config saved to /var/cache/conftool/dbconfig/20220310-135039-marostegui.json [13:50:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1164.eqiad.wmnet with reason: Maintenance [13:50:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1164.eqiad.wmnet with reason: Maintenance [13:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:43] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [13:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298294)', diff saved to https://phabricator.wikimedia.org/P22319 and previous config saved to /var/cache/conftool/dbconfig/20220310-135047-marostegui.json [13:50:47] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=ores,name=codfw [13:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:04] !log repool ores in codfw in discovery records [13:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:39] PROBLEM - Check systemd state on ores2005 is CRITICAL: CRITICAL - degraded: The following units failed: celery-ores-worker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:40] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=ores,name=eqiad [13:55:42] !log depool ores in eqiad from discovery records to initiate reboot of rdb1011 [13:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298294)', diff saved to https://phabricator.wikimedia.org/P22320 and previous config saved to /var/cache/conftool/dbconfig/20220310-135659-marostegui.json [13:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:03] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [13:57:23] RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:50] (03PS5) 10Giuseppe Lavagetto: C:varnish: add the external cloud vendors file to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769667 (https://phabricator.wikimedia.org/T270391) [14:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T1400). [14:00:04] Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] I can deploy today [14:00:25] urbanecm: Hi! [14:00:29] hi Tchanders! [14:02:07] Tchanders: looks good to me, but I'm wondering why it's made for beta enwiki only (and afaics, the only stream that does that). [14:02:49] PROBLEM - Host rdb1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:09] urbanecm: We're scoping it small for now after accidentally overriding some other streams last week. Our QAs are happy with this for now [14:03:21] fair enough :) [14:03:26] (03PS2) 10Urbanecm: beta: Include mediawiki.ipinfo_interaction in $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769484 (owner: 10Phuedx) [14:03:30] (03CR) 10Urbanecm: [C: 03+2] beta: Include mediawiki.ipinfo_interaction in $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769484 (owner: 10Phuedx) [14:03:45] Tchanders: should be deployed to beta within ~30 minutes [14:03:49] anything else? [14:04:17] (03Merged) 10jenkins-bot: beta: Include mediawiki.ipinfo_interaction in $wgEventLoggingStreamNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769484 (owner: 10Phuedx) [14:04:17] RECOVERY - Host rdb1011 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:04:29] urbanecm: Thanks! Nothing else from me [14:04:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:05:29] okay :) [14:06:48] !log UTC afternoon B&C done [14:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:20] !log repool ores in eqiad in discovery records [14:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:08:29] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=ores,name=eqiad [14:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:33] (03PS1) 10DCausse: flink-session-cluster: fix swift API key for s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769699 (https://phabricator.wikimedia.org/T302494) [14:08:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] C:varnish: add the external cloud vendors file to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769667 (https://phabricator.wikimedia.org/T270391) (owner: 10Giuseppe Lavagetto) [14:08:42] (03CR) 10jerkins-bot: [V: 04-1] flink-session-cluster: fix swift API key for s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769699 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [14:09:03] (03PS2) 10DCausse: flink-session-cluster: fix swift API key for s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769699 (https://phabricator.wikimedia.org/T302494) [14:09:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:09:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [14:12:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P22321 and previous config saved to /var/cache/conftool/dbconfig/20220310-141204-marostegui.json [14:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:56] (03CR) 10Gehel: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/769699 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [14:20:11] (03CR) 10DCausse: [C: 03+2] flink-session-cluster: fix swift API key for s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769699 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [14:21:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add missing Build-Depends entry [software/httpbb] - 10https://gerrit.wikimedia.org/r/761442 (owner: 10RLazarus) [14:22:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T302950)', diff saved to https://phabricator.wikimedia.org/P22322 and previous config saved to /var/cache/conftool/dbconfig/20220310-142248-ladsgroup.json [14:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:53] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [14:23:45] (03Merged) 10jenkins-bot: flink-session-cluster: fix swift API key for s3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769699 (https://phabricator.wikimedia.org/T302494) (owner: 10DCausse) [14:25:30] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org [14:25:31] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:23] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P22323 and previous config saved to /var/cache/conftool/dbconfig/20220310-142709-marostegui.json [14:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] Merge rdb2008 and rdb2010 site.pp stanzas [puppet] - 10https://gerrit.wikimedia.org/r/769665 (owner: 10Alexandros Kosiaris) [14:30:31] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [14:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [14:30:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] ferm: SERVICES_KUBEPODS_NETWORKS to WIKIKUBE_KUBEPODS_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/769666 (owner: 10Alexandros Kosiaris) [14:30:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1017.eqiad.wmnet with OS b... [14:35:18] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/769587 (https://phabricator.wikimedia.org/T295405) (owner: 10Eevans) [14:35:47] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM; see the two small suggestions." [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [14:37:06] (03PS1) 10Hnowlan: cassandra: add stub cred for image_suggestions [labs/private] - 10https://gerrit.wikimedia.org/r/769706 (https://phabricator.wikimedia.org/T295405) [14:37:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22324 and previous config saved to /var/cache/conftool/dbconfig/20220310-143753-ladsgroup.json [14:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:12] PROBLEM - SSH on dumpsdata1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:38:13] (03CR) 10Giuseppe Lavagetto: "LGTM, see my comment linked to the followup patch." [puppet] - 10https://gerrit.wikimedia.org/r/769466 (owner: 10Jbond) [14:41:34] (03CR) 10Btullis: [C: 03+1] cassandra: add stub cred for image_suggestions [labs/private] - 10https://gerrit.wikimedia.org/r/769706 (https://phabricator.wikimedia.org/T295405) (owner: 10Hnowlan) [14:41:50] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] cassandra: add stub cred for image_suggestions [labs/private] - 10https://gerrit.wikimedia.org/r/769706 (https://phabricator.wikimedia.org/T295405) (owner: 10Hnowlan) [14:41:56] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: new kernel [14:41:58] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: new kernel [14:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298294)', diff saved to https://phabricator.wikimedia.org/P22325 and previous config saved to /var/cache/conftool/dbconfig/20220310-144214-marostegui.json [14:42:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:42:18] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [14:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298294)', diff saved to https://phabricator.wikimedia.org/P22326 and previous config saved to /var/cache/conftool/dbconfig/20220310-144222-marostegui.json [14:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:59] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34190/console" [puppet] - 10https://gerrit.wikimedia.org/r/769587 (https://phabricator.wikimedia.org/T295405) (owner: 10Eevans) [14:43:22] (03PS1) 10Marostegui: wmnet: Switchover m2-master [dns] - 10https://gerrit.wikimedia.org/r/769708 [14:43:57] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10akosiaris) >>! In T202061#7760129, @CDanis wrote: > And while we haven't yet had... [14:44:49] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye [14:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1017.eqiad.wmnet with OS bulls... [14:49:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T300775)', diff saved to https://phabricator.wikimedia.org/P22327 and previous config saved to /var/cache/conftool/dbconfig/20220310-144900-marostegui.json [14:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:49:06] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [14:49:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298294)', diff saved to https://phabricator.wikimedia.org/P22328 and previous config saved to /var/cache/conftool/dbconfig/20220310-144911-marostegui.json [14:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:15] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [14:50:24] (03Abandoned) 10Jbond: C:varnish: Add the external_cloud_vendors module to the cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/769132 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [14:50:26] (03CR) 10Vgutierrez: [C: 04-1] "Overall LGTM, please check the inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [14:50:32] (03PS7) 10Jbond: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) [14:50:41] (03PS6) 10Jbond: C:varnish: update templates netmapper public clouds [puppet] - 10https://gerrit.wikimedia.org/r/769466 [14:50:48] (03PS3) 10Jbond: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 [14:50:54] (03PS7) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) [14:52:47] PROBLEM - puppet last run on cp6011 is CRITICAL: CRITICAL: Puppet has been disabled for 604977 seconds, message: bblack - filippo, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:52:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22329 and previous config saved to /var/cache/conftool/dbconfig/20220310-145258-ladsgroup.json [14:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:13] PROBLEM - Check systemd state on kubernetes1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:28] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:04] (03PS8) 10Jbond: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) [14:55:06] (03PS4) 10Jbond: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 [14:55:33] PROBLEM - Check whether ferm is active by checking the default input chain on thanos-fe2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:55:51] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:25] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:35] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) >>! In T202061#7766411, @Abbe98 wrote: > Is there any public documentation... [14:58:39] (03PS1) 10Cwhite: hiera: add pki to logging env [puppet] - 10https://gerrit.wikimedia.org/r/769711 (https://phabricator.wikimedia.org/T300130) [15:02:20] (03PS2) 10Cwhite: hiera: add pki to logging env [puppet] - 10https://gerrit.wikimedia.org/r/769711 (https://phabricator.wikimedia.org/T300130) [15:03:27] (03PS3) 10Cwhite: hiera: add pki to logging env [puppet] - 10https://gerrit.wikimedia.org/r/769711 (https://phabricator.wikimedia.org/T300130) [15:04:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P22330 and previous config saved to /var/cache/conftool/dbconfig/20220310-150405-marostegui.json [15:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P22331 and previous config saved to /var/cache/conftool/dbconfig/20220310-150417-marostegui.json [15:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:47] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:52] (03CR) 10Jbond: C:varnish: use X-Public-Cloud to store the cloud provider (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [15:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T302950)', diff saved to https://phabricator.wikimedia.org/P22332 and previous config saved to /var/cache/conftool/dbconfig/20220310-150803-ladsgroup.json [15:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:08] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [15:08:13] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/34191/" [puppet] - 10https://gerrit.wikimedia.org/r/769426 (owner: 10Muehlenhoff) [15:08:15] (03CR) 10Muehlenhoff: [C: 03+2] role::cluster_management: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/769426 (owner: 10Muehlenhoff) [15:08:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [15:08:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1121.eqiad.wmnet with reason: Maintenance [15:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T302950)', diff saved to https://phabricator.wikimedia.org/P22333 and previous config saved to /var/cache/conftool/dbconfig/20220310-150839-ladsgroup.json [15:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:51] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:11:58] (03PS1) 10Muehlenhoff: Remove cumin2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/769712 (https://phabricator.wikimedia.org/T303399) [15:12:29] (03CR) 10jerkins-bot: [V: 04-1] Remove cumin2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/769712 (https://phabricator.wikimedia.org/T303399) (owner: 10Muehlenhoff) [15:13:45] RECOVERY - Check systemd state on kubernetes1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:24] (03PS2) 10Muehlenhoff: Remove cumin2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/769712 (https://phabricator.wikimedia.org/T303399) [15:16:33] RECOVERY - Check systemd state on thanos-fe2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P22334 and previous config saved to /var/cache/conftool/dbconfig/20220310-151910-marostegui.json [15:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P22335 and previous config saved to /var/cache/conftool/dbconfig/20220310-151923-marostegui.json [15:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:26] !log installing expat security updates on stretch [15:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:52] (03CR) 10Volans: "Looks good, small nits, and minor changes as discussed offline" [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [15:27:11] RECOVERY - Check whether ferm is active by checking the default input chain on thanos-fe2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:27:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1121.eqiad.wmnet with OS bullseye [15:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:23] !log upload certspotter 0.10-1wm1 to apt.wm.o - T204993 [15:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:27] T204993: Update certspotter - https://phabricator.wikimedia.org/T204993 [15:33:59] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:34:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T300775)', diff saved to https://phabricator.wikimedia.org/P22336 and previous config saved to /var/cache/conftool/dbconfig/20220310-153416-marostegui.json [15:34:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1104.eqiad.wmnet with reason: Maintenance [15:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1104.eqiad.wmnet with reason: Maintenance [15:34:20] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [15:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T300775)', diff saved to https://phabricator.wikimedia.org/P22337 and previous config saved to /var/cache/conftool/dbconfig/20220310-153424-marostegui.json [15:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298294)', diff saved to https://phabricator.wikimedia.org/P22338 and previous config saved to /var/cache/conftool/dbconfig/20220310-153428-marostegui.json [15:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:32] T298294: Make primary key filearchive.fa_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T298294 [15:35:49] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: update package and replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/768065 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh) [15:36:28] hnowlan: ok to merge your change? :) [15:36:36] (yes for mine, if you are doing it) [15:36:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1121.eqiad.wmnet with reason: host reimage [15:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:49] !log rolling restart of thumbor to pick up expat security updates [15:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:38] sukhe: oops, I'll merge - thanks [15:39:10] thanks [15:39:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1121.eqiad.wmnet with reason: host reimage [15:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:31] done [15:40:19] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/769676 (owner: 10Volans) [15:40:58] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:41:47] (03PS8) 10Jbond: C:varnish: create rate limit keyed on the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769469 (https://phabricator.wikimedia.org/T270391) [15:42:06] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:35] (03CR) 10Volans: [C: 03+2] alertmanager: do not retry on HTTP 500 response [software/spicerack] - 10https://gerrit.wikimedia.org/r/769676 (owner: 10Volans) [15:44:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769711 (https://phabricator.wikimedia.org/T300130) (owner: 10Cwhite) [15:45:46] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 66 probes of 668 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:47:46] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [15:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:35] (03Merged) 10jenkins-bot: alertmanager: do not retry on HTTP 500 response [software/spicerack] - 10https://gerrit.wikimedia.org/r/769676 (owner: 10Volans) [15:49:12] (03CR) 10Jbond: [C: 03+2] P:puppet_compiler: add netbox checkout to puppet compilers [puppet] - 10https://gerrit.wikimedia.org/r/769696 (owner: 10Jbond) [15:51:12] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 56 probes of 668 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:53:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1121.eqiad.wmnet with OS bullseye [15:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:24] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.3.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769717 [15:56:22] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye [15:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:31] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.3.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769717 (owner: 10Volans) [15:57:14] RECOVERY - SSH on dumpsdata1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:57:30] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [15:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:44] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for apache/doc [puppet] - 10https://gerrit.wikimedia.org/r/769718 (https://phabricator.wikimedia.org/T135991) [16:03:02] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.3.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769717 (owner: 10Volans) [16:04:46] (03PS1) 10Volans: Upstream release v2.3.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769719 [16:04:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T302950)', diff saved to https://phabricator.wikimedia.org/P22339 and previous config saved to /var/cache/conftool/dbconfig/20220310-160457-ladsgroup.json [16:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:05] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [16:07:26] (03PS1) 10Ladsgroup: auto_schema: Add abaility to skip replicas [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) [16:07:51] (03CR) 10Ladsgroup: "Haven't tested it yet." [software] - 10https://gerrit.wikimedia.org/r/769720 (https://phabricator.wikimedia.org/T301779) (owner: 10Ladsgroup) [16:08:10] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:08:16] (03CR) 10Volans: [C: 03+2] Upstream release v2.3.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769719 (owner: 10Volans) [16:08:42] XioNoX: wut? ^^^ eqsin cr3 BGP [16:09:44] I can ssh [16:11:39] yeah it's back to normal on the icinga UI [16:11:59] ack, transient I guess [16:14:08] (03Merged) 10jenkins-bot: Upstream release v2.3.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769719 (owner: 10Volans) [16:14:22] (03PS1) 10Muehlenhoff: Add new Cumin alias parsoid-testing [puppet] - 10https://gerrit.wikimedia.org/r/769722 [16:15:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] C:varnish: use X-Public-Cloud to store the cloud provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [16:16:35] (03CR) 10Muehlenhoff: [C: 03+2] Add new Cumin alias parsoid-testing [puppet] - 10https://gerrit.wikimedia.org/r/769722 (owner: 10Muehlenhoff) [16:17:03] (03CR) 10Giuseppe Lavagetto: [C: 04-1] C:varnish: use X-Public-Cloud to store the cloud provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [16:20:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22340 and previous config saved to /var/cache/conftool/dbconfig/20220310-162004-ladsgroup.json [16:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:12] !log uploaded spicerack_2.3.1 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [16:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:53] (03PS5) 10Jbond: C:varnish: use X-Public-Cloud to store the cloud provider [puppet] - 10https://gerrit.wikimedia.org/r/769511 [16:21:59] (03CR) 10Jbond: "update thanks" [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [16:23:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [16:24:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34194/console" [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [16:25:42] (03PS1) 10Jbond: puppet_compiler: fix netbox checkout path [puppet] - 10https://gerrit.wikimedia.org/r/769724 [16:25:58] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compiler: fix netbox checkout path [puppet] - 10https://gerrit.wikimedia.org/r/769724 (owner: 10Jbond) [16:26:08] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) [16:26:43] (03CR) 10jerkins-bot: [V: 04-1] Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:27:24] (03PS3) 10Volans: sre.hosts.downtime: conver to use the new alerting [cookbooks] - 10https://gerrit.wikimedia.org/r/769067 (https://phabricator.wikimedia.org/T293209) [16:27:30] (03PS4) 10Volans: sre.hosts.downtime: conver to use the new alerting [cookbooks] - 10https://gerrit.wikimedia.org/r/769067 (https://phabricator.wikimedia.org/T293209) [16:30:41] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) [16:30:44] !log depool doh1002 for testing eBPF [16:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:49] (03PS1) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [16:31:18] (03CR) 10jerkins-bot: [V: 04-1] Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:33:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on doh1002.wikimedia.org with reason: testing eBPF filtering [16:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on doh1002.wikimedia.org with reason: testing eBPF filtering [16:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:24] (03CR) 10Cathal Mooney: [C: 03+2] Initial changes to Homer config and templates for EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/769478 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:34:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T300775)', diff saved to https://phabricator.wikimedia.org/P22341 and previous config saved to /var/cache/conftool/dbconfig/20220310-163438-marostegui.json [16:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:42] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [16:34:52] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Initial changes to Homer config and templates for EVPN switches Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/769478 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [16:35:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P22342 and previous config saved to /var/cache/conftool/dbconfig/20220310-163509-ladsgroup.json [16:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:46] (03PS1) 10Muehlenhoff: Add cumin aliases for ml-etcd [puppet] - 10https://gerrit.wikimedia.org/r/769730 [16:37:21] (03PS3) 10Muehlenhoff: Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) [16:38:19] (03CR) 10jerkins-bot: [V: 04-1] Enable profile::auto_restarts::service for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/769725 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:38:26] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:38:46] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:41:00] (03PS1) 10Muehlenhoff: Add Cumin alias for mediabackups [puppet] - 10https://gerrit.wikimedia.org/r/769731 [16:41:05] (03CR) 10Volans: "comments/questions/ideas inline" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:41:32] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: conver to use the new alerting [cookbooks] - 10https://gerrit.wikimedia.org/r/769067 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:44:15] (03Merged) 10jenkins-bot: sre.hosts.downtime: conver to use the new alerting [cookbooks] - 10https://gerrit.wikimedia.org/r/769067 (https://phabricator.wikimedia.org/T293209) (owner: 10Volans) [16:45:47] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Testing alertmanager downtime [16:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:26] (03CR) 10Muehlenhoff: gitlab_runner: add dedicated service unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769065 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:49:40] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on D{cumin1001.mgmt} with reason: Testing alertmanager downtime [16:49:41] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:05:00 on D{cumin1001.mgmt} with reason: Testing alertmanager downtime [16:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P22343 and previous config saved to /var/cache/conftool/dbconfig/20220310-164943-marostegui.json [16:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:03] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin1001.mgmt with reason: Testing alertmanager downtime [16:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:07] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin1001.mgmt with reason: Testing alertmanager downtime [16:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T302950)', diff saved to https://phabricator.wikimedia.org/P22344 and previous config saved to /var/cache/conftool/dbconfig/20220310-165014-ladsgroup.json [16:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:19] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [16:50:30] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [16:51:21] (03CR) 10Cathal Mooney: "Really appreciate the feedback/suggestions Volans thanks :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:54:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) name rack Unit Port CableID Port CableID cloudcephosd1025 e4 21u 21 20220102 ; 20 20220105 cloudcephosd1026 e4 22u 22 202201... [16:55:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [16:56:05] (03CR) 10Volans: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:56:38] (03PS2) 10Volans: sre.SREBatchRunnerBase: use alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/769437 [16:56:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) @nskaggs Would we be able to Rack these in New Wmcs Dedicated racks? E4 , F4? [16:57:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) [16:58:35] (03PS2) 10Cathal Mooney: Modify wmf-netbox plugin to provide QFX5120-48Y port block speeds [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) [16:58:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dale_Zhou - https://phabricator.wikimedia.org/T303031 (10MGerlach) @JMeybohm just wanted to check whether there is any other information needed that I might have missed. I am unsure since I still see the tag "awaiting user input... [16:59:01] (03CR) 10Vgutierrez: C:varnish: Load public-clouds.json via netmapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [17:00:04] jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T1700). [17:00:04] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:31] (03CR) 10Cathal Mooney: "Updated based on suggestions thanks. I've left the 'print' statement there for now until we decide if we want to do anything in that scen" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/769729 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [17:02:11] zabe: 👋 looking [17:02:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org [17:04:13] (03CR) 10Jcrespo: "Probably a bit generic- we should either call it mediabackup-worker or keep the name and include backup*00[4-7], which are the mediabackup" [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff) [17:04:42] starting with https://gerrit.wikimedia.org/r/761717 -- I'm disabling puppet on A:mw, will test at mwdebug1001 first then re-enable [17:04:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P22345 and previous config saved to /var/cache/conftool/dbconfig/20220310-170448-marostegui.json [17:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:02] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Change physical label from copernicum.wikimedia.org to mirror1001.wikimedia.org - https://phabricator.wikimedia.org/T297906 (10jhathaway) [17:06:07] (03CR) 10RLazarus: [C: 03+2] Remove otrs-wiki.wikimedia.org from mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/761717 (https://phabricator.wikimedia.org/T280400) (owner: 10Zabe) [17:06:54] 10SRE, 10Infrastructure-Foundations: decom sodium - https://phabricator.wikimedia.org/T298727 (10jhathaway) 05Open→03Resolved [17:06:57] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10jhathaway) [17:07:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org [17:08:13] (03CR) 10Volans: [C: 03+2] sre.SREBatchRunnerBase: use alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/769437 (owner: 10Volans) [17:08:58] zabe: at mwdebug1001, please test if you're around :) [17:09:05] (03Abandoned) 10Volans: Emergency depool of eqsin [dns] - 10https://gerrit.wikimedia.org/r/769086 (owner: 10Volans) [17:09:17] checking myself too, will proceed in a few minutes if I don't hear anything [17:09:43] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10jhathaway) [17:09:59] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10jhathaway) 05Open→03Resolved We are no longer seeing the timeouts after setting the sysctl net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl to 3600 which restores the... [17:10:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install clouddumps100[12] - https://phabricator.wikimedia.org/T299610 (10RobH) 05Open→03Declined Looks like I filed both T302981 & T299610, and T299610 has less recent details, and wasnt linked into T286588, so... [17:10:47] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10RLazarus) [17:10:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) Looks like I filed both T302981 & T299610, and T299610 has less recent details, and wasnt linked into T286588, so declinin... [17:11:05] (03Merged) 10jenkins-bot: sre.SREBatchRunnerBase: use alerting_hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/769437 (owner: 10Volans) [17:12:35] (03CR) 10Ahmon Dancy: [C: 03+1] "Timo, would you like me to deploy this for you?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761965 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle) [17:13:12] (03CR) 10Jbond: C:varnish: Load public-clouds.json via netmapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [17:15:12] (03PS1) 10Jelto: gitlab_runner: cleanup service unit file [puppet] - 10https://gerrit.wikimedia.org/r/769737 (https://phabricator.wikimedia.org/T295481) [17:16:09] going ahead -- mwdebug1001 looks fine from manual testing, httpbb passes except for one broken test that's unrelated (will follow up on that separately) [17:16:22] puppet re-enabled [17:16:39] (03CR) 10Jelto: gitlab_runner: add dedicated service unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769065 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:16:44] (03CR) 10Ahmon Dancy: beta::autoupdater: Remove more obsolete stuff after scap prep auto [puppet] - 10https://gerrit.wikimedia.org/r/753787 (owner: 10Ahmon Dancy) [17:17:12] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:17:30] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:17:51] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34195/console" [puppet] - 10https://gerrit.wikimedia.org/r/769737 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [17:19:01] 10Puppet, 10Infrastructure-Foundations: Where to Put Community Modules? - https://phabricator.wikimedia.org/T302423 (10jhathaway) a:03jhathaway [17:19:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1071.mgmt.eqiad.wmnet with reboot policy FORCED [17:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:44] (03CR) 10Krinkle: "Sure, that'd be great. I sometimes roll them out inside windows, but any time is fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761965 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle) [17:19:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T300775)', diff saved to https://phabricator.wikimedia.org/P22346 and previous config saved to /var/cache/conftool/dbconfig/20220310-171953-marostegui.json [17:19:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [17:19:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [17:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:57] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [17:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T300775)', diff saved to https://phabricator.wikimedia.org/P22347 and previous config saved to /var/cache/conftool/dbconfig/20220310-172001-marostegui.json [17:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:24] jouncebot nowandnext [17:20:25] For the next 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T1700) [17:20:25] In 1 hour(s) and 39 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T1900) [17:20:30] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:20:47] (03PS2) 10JMeybohm: Make k8s-ingress-wikikube page [puppet] - 10https://gerrit.wikimedia.org/r/767078 (https://phabricator.wikimedia.org/T290966) [17:23:01] _joe_: if you have a moment, can you eyeball https://gerrit.wikimedia.org/r/c/operations/puppet/+/768260 before I merge it in the puppet window? looks right to me but I'd like a cross-check from someone more familiar with all the ways wikitech is a special snowfflake [17:23:23] <_joe_> rzl: I'm in a meeting sorry [17:23:34] no worries, sorry for not checking [17:23:40] (03CR) 10Reedy: [C: 03+1] wikitech_private: write to wmg* constants [puppet] - 10https://gerrit.wikimedia.org/r/768260 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [17:23:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1068.mgmt.eqiad.wmnet with reboot policy FORCED [17:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:01] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] Create Cassandra role for Image Suggestions dataset [puppet] - 10https://gerrit.wikimedia.org/r/769587 (https://phabricator.wikimedia.org/T295405) (owner: 10Eevans) [17:24:10] ah, if Reedy is happy I'm happy :) [17:24:28] One could argue that because I'm British that I'm never happy... [17:24:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1069.mgmt.eqiad.wmnet with reboot policy FORCED [17:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:57] well, in that case if Reedy isn't disgruntled I'm leaping with joy [17:25:02] given the exchange rate [17:25:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1070.mgmt.eqiad.wmnet with reboot policy FORCED [17:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:34] (03CR) 10RLazarus: [C: 03+2] wikitech_private: write to wmg* constants [puppet] - 10https://gerrit.wikimedia.org/r/768260 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [17:28:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ml-serve1005.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ml-serve1006.mgmt.eqiad.wmnet with reboot policy FORCED [17:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:55] (03Abandoned) 10JHathaway: mirror - wip [puppet] - 10https://gerrit.wikimedia.org/r/745606 (owner: 10JHathaway) [17:29:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ml-serve1007.mgmt.eqiad.wmnet with reboot policy FORCED [17:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ml-serve1008.mgmt.eqiad.wmnet with reboot policy FORCED [17:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:15] seems good 👍 puppet request window complete [17:30:47] I'm going to deploy a config change. [17:30:53] all yours [17:33:10] (03CR) 10Ahmon Dancy: [C: 03+2] [Beta Cluster] use require_once instead of include for import.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761964 (owner: 10Krinkle) [17:33:25] (03CR) 10jerkins-bot: [V: 04-1] [Beta Cluster] use require_once instead of include for import.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761964 (owner: 10Krinkle) [17:34:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [17:35:44] (03PS3) 10Ahmon Dancy: wmf-config: Use __DIR__ instead of "$IP/../wmf-config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761965 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle) [17:35:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org [17:35:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:58] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Change physical label from copernicum.wikimedia.org to mirror1001.wikimedia.org - https://phabricator.wikimedia.org/T297906 (10wiki_willy) a:03Cmjohnson [17:37:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:37:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:43] (03CR) 10Ahmon Dancy: [C: 03+2] wmf-config: Use __DIR__ instead of "$IP/../wmf-config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761965 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle) [17:38:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:27] (03Merged) 10jenkins-bot: wmf-config: Use __DIR__ instead of "$IP/../wmf-config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761965 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle) [17:40:25] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10Cmjohnson) [17:40:27] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Change physical label from copernicum.wikimedia.org to mirror1001.wikimedia.org - https://phabricator.wikimedia.org/T297906 (10Cmjohnson) 05Open→03Resolved done [17:40:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org [17:40:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1069.mgmt.eqiad.wmnet with reboot policy FORCED [17:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1071.mgmt.eqiad.wmnet with reboot policy FORCED [17:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1068.mgmt.eqiad.wmnet with reboot policy FORCED [17:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be1070.mgmt.eqiad.wmnet with reboot policy FORCED [17:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:35] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Change physical label from copernicum.wikimedia.org to mirror1001.wikimedia.org - https://phabricator.wikimedia.org/T297906 (10jhathaway) thanks! [17:41:44] !log dancy@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:761965|wmf-config: Use __DIR__ instead of "$IP/../wmf-config" (T45956)]] (duration: 00m 50s) [17:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:48] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [17:41:52] I'm done. [17:42:11] 10SRE, 10Infrastructure-Foundations: Setup new mirror server (mirror1001.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10jhathaway) 05Open→03Resolved [17:43:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:18] !log depooling thanos-fe1001 for envoy upgrade [17:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:44:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:04] rzl, hey, sorry I wasn't around, I have lost track of time [17:48:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:04] PROBLEM - SSH on wtp1026.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:50:11] zabe: no worries! got both patches merged, let me know if you see any problems [17:50:47] ok, thx :) [17:51:28] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:51:30] !log repool thanos-fe1001 [17:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:47] (03CR) 10Razzi: [C: 03+1] prometheus: Add more per-index metrics for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/769123 (https://phabricator.wikimedia.org/T300295) (owner: 10Ebernhardson) [17:56:56] (03CR) 10Razzi: [C: 03+2] prometheus: Add more per-index metrics for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/769123 (https://phabricator.wikimedia.org/T300295) (owner: 10Ebernhardson) [17:58:45] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban: Increase max.incremental.fetch.session.cache.slots on Kafka jumbo eqiad - https://phabricator.wikimedia.org/T303324 (10odimitrijevic) p:05Triage→03Medium [17:58:59] (03PS2) 10Ebernhardson: alertmanager: Configure task creation for search-platform [puppet] - 10https://gerrit.wikimedia.org/r/769131 (https://phabricator.wikimedia.org/T300295) [17:59:07] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769131 (https://phabricator.wikimedia.org/T300295) (owner: 10Ebernhardson) [17:59:20] RECOVERY - Check systemd state on datahubsearch1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:36] (03PS3) 10Ryan Kemper: alertmanager: Configure task creation for search-platform [puppet] - 10https://gerrit.wikimedia.org/r/769131 (https://phabricator.wikimedia.org/T300295) (owner: 10Ebernhardson) [18:05:04] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [18:05:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/769737 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [18:06:01] (03CR) 10Razzi: [C: 03+2] alertmanager: Configure task creation for search-platform [puppet] - 10https://gerrit.wikimedia.org/r/769131 (https://phabricator.wikimedia.org/T300295) (owner: 10Ebernhardson) [18:06:34] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [18:07:23] (03PS1) 10MSantos: mobileapps: Bump to 2022-03-10-175759-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/769744 [18:07:40] PROBLEM - Check systemd state on datahubsearch1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:28] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye [18:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:58] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [18:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:31] (03CR) 10Muehlenhoff: Add Cumin alias for mediabackups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769731 (owner: 10Muehlenhoff) [18:09:51] (03CR) 10Ottomata: [C: 03+1] Enable tls proxy telemetry by default in eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/768038 (https://phabricator.wikimedia.org/T303042) (owner: 10JMeybohm) [18:10:14] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [18:11:21] !log installing cyrus-sasl2 security updates [18:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:58] (03CR) 10MSantos: [C: 03+2] mobileapps: Bump to 2022-03-10-175759-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/769744 (owner: 10MSantos) [18:13:33] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [18:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [18:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:12] (03PS1) 10JMeybohm: Stop loading wddx PHP extension with PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) [18:14:31] 10SRE, 10serviceops, 10good first task: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata) [18:14:37] (03PS6) 10Ottomata: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [18:14:48] (03CR) 10jerkins-bot: [V: 04-1] charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [18:15:22] !log systemctl restart prometheus-wmf-elasticsearch-exporter-9200.service on elastic2042 for T300295 [18:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:26] T300295: Create an alert based on index age for reindexing Commons and Wikidata - https://phabricator.wikimedia.org/T300295 [18:16:52] (03Merged) 10jenkins-bot: mobileapps: Bump to 2022-03-10-175759-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/769744 (owner: 10MSantos) [18:17:18] 10SRE, 10serviceops, 10Patch-For-Review, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Ottomata) BTW, I made a specific task to track the work to make eventgate chart use common_templates: {T303543} cc @BTullis [18:17:29] (03CR) 10JMeybohm: [C: 03+2] Enable tls proxy telemetry by default in eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/768038 (https://phabricator.wikimedia.org/T303042) (owner: 10JMeybohm) [18:19:10] !log cumin 'C:elasticsearch' 'systemctl restart prometheus-wmf-elasticsearch-exporter-9200.service' [18:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:25] (03PS1) 10Phuedx: Remove unused wgWMESearchRelevancePages config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769748 [18:19:39] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1017.eqiad.wmnet with OS bullseye [18:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:04] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1017.eqiad.wmnet with OS bullseye [18:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T300775)', diff saved to https://phabricator.wikimedia.org/P22348 and previous config saved to /var/cache/conftool/dbconfig/20220310-182015-marostegui.json [18:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:19] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [18:21:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10nskaggs) @Jclark-ctr I would want confirmation from infa foundations that all the necessary network connectivity is present. Fro... [18:21:09] (03Merged) 10jenkins-bot: Enable tls proxy telemetry by default in eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/768038 (https://phabricator.wikimedia.org/T303042) (owner: 10JMeybohm) [18:21:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34196/console" [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) (owner: 10JMeybohm) [18:21:26] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 668 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:24:41] (03PS2) 10Ebernhardson: Create phab task when indices are too old [alerts] - 10https://gerrit.wikimedia.org/r/769127 (https://phabricator.wikimedia.org/T300295) [18:26:58] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [18:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:16] (03Abandoned) 10Ebernhardson: icinga: Move cirrus check into cirrus_cluster_checks [puppet] - 10https://gerrit.wikimedia.org/r/768818 (owner: 10Ebernhardson) [18:27:22] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [18:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:14] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 668 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:28:24] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [18:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:03] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/769749 [18:29:15] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [18:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:34] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [18:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:12] (03CR) 10Ryan Kemper: [C: 03+2] Prevent caching of auth redirect [puppet] - 10https://gerrit.wikimedia.org/r/768777 (https://phabricator.wikimedia.org/T301650) (owner: 10Ebernhardson) [18:33:31] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [18:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P22349 and previous config saved to /var/cache/conftool/dbconfig/20220310-183520-marostegui.json [18:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:48] !log installing tiff security updates [18:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:24] (03PS1) 10Zabe: wikitech: migrate wmf* to wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) [18:37:37] (03PS2) 10JMeybohm: Stop loading wddx PHP extension with PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) [18:38:53] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34199/console" [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) (owner: 10JMeybohm) [18:39:36] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [18:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:52] (03PS2) 10Zabe: wikitech: migrate wmf* to wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) [18:40:23] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [18:40:24] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [18:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:56] !log restarting thumbor to pick up tiff security updates [18:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:27] (03PS3) 10Zabe: wikitech: migrate wmf* to wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) [18:41:40] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [18:41:41] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [18:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:23] (03CR) 10CDanis: [C: 03+1] C:varnish: Load public-clouds.json via netmapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [18:43:06] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/769751 [18:43:11] (03PS3) 10JMeybohm: Stop loading wddx PHP extension with PHP 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) [18:43:21] (03CR) 10Razzi: [C: 03+2] Create phab task when indices are too old [alerts] - 10https://gerrit.wikimedia.org/r/769127 (https://phabricator.wikimedia.org/T300295) (owner: 10Ebernhardson) [18:43:51] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [18:43:52] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [18:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:38] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34201/console" [puppet] - 10https://gerrit.wikimedia.org/r/769745 (https://phabricator.wikimedia.org/T295725) (owner: 10JMeybohm) [18:45:35] (03CR) 10CDanis: [C: 03+1] C:varnish: use X-Public-Cloud to store the cloud provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769511 (owner: 10Jbond) [18:45:53] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: idp.wikimedia.org asking twice for YubiKey - https://phabricator.wikimedia.org/T258029 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was fixed by setting u2f_token_expiry_days to 3650 last year, marking as resolved. [18:46:03] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [18:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:20] (03CR) 10Krinkle: [C: 04-1] wikitech: migrate wmf* to wmg* (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [18:46:23] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Investigate/enable new actuators for U2F token management - https://phabricator.wikimedia.org/T277837 (10MoritzMuehlenhoff) 05Open→03Declined U2F has been deprecated by Chrome, will be replaced by webauth. [18:47:06] (03Merged) 10jenkins-bot: Create phab task when indices are too old [alerts] - 10https://gerrit.wikimedia.org/r/769127 (https://phabricator.wikimedia.org/T300295) (owner: 10Ebernhardson) [18:50:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P22350 and previous config saved to /var/cache/conftool/dbconfig/20220310-185025-marostegui.json [18:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:15] (03PS1) 10Muehlenhoff: Stop pinning the TGC cookie to the user agent and IP address [puppet] - 10https://gerrit.wikimedia.org/r/769753 (https://phabricator.wikimedia.org/T273858) [18:51:42] RECOVERY - SSH on wtp1026.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:53:06] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:53:42] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/769751 (owner: 10PipelineBot) [18:54:39] (03PS1) 10Volans: alertmanager: add support for dry-run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/769754 [18:54:41] (03PS1) 10Volans: reposync: make tests run quicker [software/spicerack] - 10https://gerrit.wikimedia.org/r/769755 [18:55:07] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/769756 [18:55:28] (03PS1) 10Krinkle: tests: Remove leftover wmfConfigDir global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769757 [18:55:50] (03CR) 10Jbond: [C: 03+1] alertmanager: add support for dry-run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/769754 (owner: 10Volans) [18:56:03] (03CR) 10Jbond: [C: 03+1] reposync: make tests run quicker [software/spicerack] - 10https://gerrit.wikimedia.org/r/769755 (owner: 10Volans) [18:56:04] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [18:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:30] (03PS2) 10Krinkle: [Beta Cluster] use require_once instead of include for import.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761964 [18:57:07] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [18:57:08] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [18:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:31] (03CR) 10Herron: "something to get the ball rolling, here's a PCC https://puppet-compiler.wmflabs.org/pcc-worker1003/34202/" [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron) [18:58:31] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [18:58:33] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [18:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:39] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [18:59:40] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [18:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] dancy and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T1900). [19:00:55] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [19:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:27] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [19:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:33] o/ [19:02:30] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [19:02:32] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [19:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:29] (03CR) 10Volans: [C: 03+2] alertmanager: add support for dry-run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/769754 (owner: 10Volans) [19:04:32] (03CR) 10Volans: [C: 03+2] reposync: make tests run quicker [software/spicerack] - 10https://gerrit.wikimedia.org/r/769755 (owner: 10Volans) [19:04:37] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [19:04:38] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [19:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:53] (03PS9) 10Jbond: C:varnish: Load public-clouds.json via netmapper [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) [19:05:09] (03CR) 10Jbond: C:varnish: Load public-clouds.json via netmapper (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769464 (https://phabricator.wikimedia.org/T270391) (owner: 10Jbond) [19:05:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T300775)', diff saved to https://phabricator.wikimedia.org/P22351 and previous config saved to /var/cache/conftool/dbconfig/20220310-190530-marostegui.json [19:05:31] (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] use require_once instead of include for import.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761964 (owner: 10Krinkle) [19:05:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [19:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:34] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [19:05:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [19:05:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T300775)', diff saved to https://phabricator.wikimedia.org/P22352 and previous config saved to /var/cache/conftool/dbconfig/20220310-190544-marostegui.json [19:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:00] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [19:06:01] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [19:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:16] (03Merged) 10jenkins-bot: [Beta Cluster] use require_once instead of include for import.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761964 (owner: 10Krinkle) [19:06:58] (03CR) 10Krinkle: [C: 03+1] Remove unused wgWMESearchRelevancePages config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769748 (owner: 10Phuedx) [19:07:27] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [19:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:34] dancy: ^ no-op file to be pulled down with git. I won't git pull now in case you're already in that directory [19:07:45] monitoring beta cluster meanwhile [19:08:50] appears train is currently blocked, fwiw. [19:09:07] (03PS1) 10Zabe: wikitech: migrate wmf* to wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769760 (https://phabricator.wikimedia.org/T45956) [19:09:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:45] (03Abandoned) 10Zabe: wikitech: migrate wmf* to wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769760 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:10:12] (03PS4) 10Zabe: wikitech: migrate wmf* to wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) [19:10:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:10:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:46] (03Merged) 10jenkins-bot: alertmanager: add support for dry-run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/769754 (owner: 10Volans) [19:10:48] (03Merged) 10jenkins-bot: reposync: make tests run quicker [software/spicerack] - 10https://gerrit.wikimedia.org/r/769755 (owner: 10Volans) [19:11:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:30] (03CR) 10Zabe: wikitech: migrate wmf* to wmg* (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769750 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [19:11:35] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Data-Engineering, 10Event-Platform, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10JMeybohm) [19:13:52] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:13:58] (03CR) 10JMeybohm: [C: 03+1] "I'd really like this. So my opinion is to be considered tainted :-)" [puppet] - 10https://gerrit.wikimedia.org/r/769753 (https://phabricator.wikimedia.org/T273858) (owner: 10Muehlenhoff) [19:14:30] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/769761 [19:16:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:32] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.3.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769762 [19:18:51] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.3.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769762 (owner: 10Volans) [19:21:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:21:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:50] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/769756 (owner: 10PipelineBot) [19:21:56] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/769761 (owner: 10PipelineBot) [19:22:17] (03CR) 10Jbond: [C: 03+1] Stop pinning the TGC cookie to the user agent and IP address [puppet] - 10https://gerrit.wikimedia.org/r/769753 (https://phabricator.wikimedia.org/T273858) (owner: 10Muehlenhoff) [19:25:21] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.3.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/769762 (owner: 10Volans) [19:25:45] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/769761 (owner: 10PipelineBot) [19:29:09] !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [19:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:25] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [19:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:38] (03PS1) 10Volans: Upstream release v2.3.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769764 [19:31:40] !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [19:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:08] (03CR) 10Volans: [C: 03+2] Upstream release v2.3.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/769764 (owner: 10Volans) [19:32:23] !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [19:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:39] !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [19:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:19] !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [19:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:14] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org [19:43:04] (03PS1) 10Ahmon Dancy: group2 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769768 [19:43:07] (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769768 (owner: 10Ahmon Dancy) [19:43:43] (03Merged) 10jenkins-bot: group2 wikis to 1.38.0-wmf.25 refs T300201 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769768 (owner: 10Ahmon Dancy) [19:44:12] !log uploaded spicerack_2.3.2 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [19:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:58] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.38.0-wmf.25 refs T300201 [19:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:02] T300201: 1.38.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T300201 [19:46:30] !log volans@cumin2002 START - Cookbook sre.misc-clusters.sretest rolling restart_daemons on A:sretest [19:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:50] !log volans@cumin2002 END (PASS) - Cookbook sre.misc-clusters.sretest (exit_code=0) rolling restart_daemons on A:sretest [19:46:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:59] !log installed spicerack v2.3.2 on the cumin hosts [19:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:48:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T300775)', diff saved to https://phabricator.wikimedia.org/P22353 and previous config saved to /var/cache/conftool/dbconfig/20220310-200558-marostegui.json [20:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:04] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [20:11:20] jouncebot: nowandnext [20:11:21] For the next 0 hour(s) and 48 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T1900) [20:11:21] In 0 hour(s) and 48 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T2100) [20:12:35] brennen, dancy: I'm going to roll out the envoy upgrade to the MW canaries, if it's an okay time? no impact expected, but happy to wait if you're in the middle of anything, just in case [20:12:50] yep. Train looks good. [20:13:06] thanks! going ahead [20:18:30] done, no problems [20:21:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P22354 and previous config saved to /var/cache/conftool/dbconfig/20220310-202103-marostegui.json [20:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:17] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [20:36:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P22355 and previous config saved to /var/cache/conftool/dbconfig/20220310-203608-marostegui.json [20:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10wiki_willy) a:03Cmjohnson [20:44:13] (03PS1) 10Bartosz Dziewoński: Preserve classes on media wrapper links [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769558 (https://phabricator.wikimedia.org/T292657) [20:44:25] (03PS1) 10Bartosz Dziewoński: Fix highlighting of comments when reloading [extensions/DiscussionTools] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769559 (https://phabricator.wikimedia.org/T303261) [20:44:40] (03PS1) 10Kosta Harlan: CommonSettings: Update comment about Image Suggestions API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769779 (https://phabricator.wikimedia.org/T294362) [20:45:06] (03PS2) 10Kosta Harlan: CommonSettings: Update comment about Image Suggestions API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769779 (https://phabricator.wikimedia.org/T294362) [20:47:12] (03CR) 10Arlolra: [C: 03+1] Preserve classes on media wrapper links [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769558 (https://phabricator.wikimedia.org/T292657) (owner: 10Bartosz Dziewoński) [20:49:38] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron) [20:50:16] (03PS1) 10JHathaway: mirrors: Raise ssl ciphersuite strength [puppet] - 10https://gerrit.wikimedia.org/r/769780 [20:50:29] (03CR) 10Cwhite: [C: 03+2] caches: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763830 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [20:50:43] (03PS2) 10JHathaway: mirrors: Raise ssl ciphersuite strength [puppet] - 10https://gerrit.wikimedia.org/r/769780 [20:50:58] (03CR) 10Cwhite: [C: 03+2] zookeeper: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763825 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [20:51:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T300775)', diff saved to https://phabricator.wikimedia.org/P22356 and previous config saved to /var/cache/conftool/dbconfig/20220310-205114-marostegui.json [20:51:14] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/769780 (owner: 10JHathaway) [20:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:19] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [20:51:46] (03CR) 10Cwhite: [C: 03+2] search: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763823 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [20:52:15] (03CR) 10SBassett: [C: 03+1] CommonSettings: Update comment about Image Suggestions API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769779 (https://phabricator.wikimedia.org/T294362) (owner: 10Kosta Harlan) [20:52:24] (03PS2) 10Cwhite: pybal: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763829 (https://phabricator.wikimedia.org/T211982) [20:52:39] (03Abandoned) 10Cwhite: pybal: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763829 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [20:53:00] (03PS3) 10Cwhite: profile: update graphite mediawiki grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763819 (https://phabricator.wikimedia.org/T211982) [20:53:30] (03PS2) 10Cwhite: labstore: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763828 (https://phabricator.wikimedia.org/T211982) [20:53:34] (03Abandoned) 10Cwhite: labstore: update grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763828 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [20:55:51] (03Abandoned) 10Cwhite: profile: update host monitoring grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763820 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [20:56:31] (03CR) 10Cwhite: [C: 03+2] profile: update graphite mediawiki grafana dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/763819 (https://phabricator.wikimedia.org/T211982) (owner: 10Cwhite) [20:56:32] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) I'm putting 1021 back in service with the old OS for now. 1016 and 1017 are still fair game for experimentation. [20:57:42] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Andrew) [20:59:32] RECOVERY - Check systemd state on datahubsearch1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:05] brennen: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T2100). Please do the needful. [21:00:05] zabe and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] o/ [21:00:23] hi [21:01:03] howdy all! [21:01:05] o/ [21:01:10] taking a look at patches for backport now [21:03:01] 👋 deployer trainee here, I'll be pushing buttons shortly [21:04:37] (03PS3) 10RLazarus: Remove centralauth-oversight from the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766307 (https://phabricator.wikimedia.org/T302675) (owner: 10Zabe) [21:04:46] (03CR) 10RLazarus: [C: 03+2] "Backporting" [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769558 (https://phabricator.wikimedia.org/T292657) (owner: 10Bartosz Dziewoński) [21:05:46] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10colewhite) [21:06:18] (03CR) 10RLazarus: [C: 03+2] Remove centralauth-oversight from the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766307 (https://phabricator.wikimedia.org/T302675) (owner: 10Zabe) [21:06:59] 10SRE, 10observability, 10Patch-For-Review, 10Performance-Team (Radar), 10User-CDanis: Upgrade grafana to 5.x - https://phabricator.wikimedia.org/T210416 (10colewhite) [21:07:00] (03CR) 10RLazarus: [C: 03+2] "Backporting" [extensions/DiscussionTools] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769559 (https://phabricator.wikimedia.org/T303261) (owner: 10Bartosz Dziewoński) [21:07:02] (03Merged) 10jenkins-bot: Remove centralauth-oversight from the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766307 (https://phabricator.wikimedia.org/T302675) (owner: 10Zabe) [21:07:14] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10colewhite) 05Open→03Resolved a:03colewhite Puppet dashboard links have been updated. Feel free to reopen if... [21:07:56] PROBLEM - Check systemd state on datahubsearch1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:39] zabe: pulled yours to mwdebug1001, please check [21:11:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:11:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:59] rzl, lgtm [21:12:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:22] 👍 [21:13:41] !log rzl@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:766307|Remove centralauth-oversight from the config (T302675)]] (duration: 00m 49s) [21:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:45] T302675: Rename centralauth-oversight to centralauth-suppress following the rename of oversight to suppress - https://phabricator.wikimedia.org/T302675 [21:15:46] thanks for your help :) [21:17:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:17:15] (03PS1) 10JHathaway: mirrors: use @resolve for syncproxy2.wna.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/769782 [21:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:20] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:17:51] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/769782 (owner: 10JHathaway) [21:17:54] (03PS1) 10Ebernhardson: prometheus: restart elastic exporter on code change [puppet] - 10https://gerrit.wikimedia.org/r/769783 [21:18:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:18:32] (03CR) 10JHathaway: "tested on mirrors.wikmedia.org, the same iptables config is generated." [puppet] - 10https://gerrit.wikimedia.org/r/769782 (owner: 10JHathaway) [21:18:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:36] (03PS2) 10Ebernhardson: prometheus: restart elastic exporter on code change [puppet] - 10https://gerrit.wikimedia.org/r/769783 [21:18:57] (03PS1) 10Ssingh: certspotter: set send_mail to true to email the output of the service [puppet] - 10https://gerrit.wikimedia.org/r/769784 [21:19:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:21] (03Merged) 10jenkins-bot: Preserve classes on media wrapper links [extensions/VisualEditor] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769558 (https://phabricator.wikimedia.org/T292657) (owner: 10Bartosz Dziewoński) [21:19:24] (03Merged) 10jenkins-bot: Fix highlighting of comments when reloading [extensions/DiscussionTools] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769559 (https://phabricator.wikimedia.org/T303261) (owner: 10Bartosz Dziewoński) [21:20:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34203/console" [puppet] - 10https://gerrit.wikimedia.org/r/769784 (owner: 10Ssingh) [21:21:32] (03PS3) 10Bking: elastic: relax & restore perms during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [21:24:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:35] (03CR) 10Ssingh: [V: 03+1 C: 03+2] certspotter: set send_mail to true to email the output of the service [puppet] - 10https://gerrit.wikimedia.org/r/769784 (owner: 10Ssingh) [21:25:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:25:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:14] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:22] MatmaRex: pulle d yours to mwdebug1001, take a look? [21:26:47] looking [21:26:49] both of them? [21:26:55] yep [21:27:24] (03CR) 10Bking: [C: 03+2] elastic: relax & restore perms during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [21:30:08] VisualEditor patch looks good [21:30:10] rzl: would you be able to merge + sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/769779 ? it's just updating a comment, by request of security team. I could add it to the deployment calendar if you'd like. [21:30:35] (03Merged) 10jenkins-bot: elastic: relax & restore perms during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [21:30:53] (sorry to show up late to deployment window time!) [21:31:00] kostajh: sure thing [21:31:21] ty [21:31:42] and DiscussionTools patch looks good too [21:31:48] rzl: everything looks good [21:31:53] MatmaRex: rad, thanks! proceeding [21:33:35] !log rzl@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/VisualEditor/modules/ve-mw: Backport: [[gerrit:769558|Preserve classes on media wrapper links (T292657 T303469)]] (duration: 00m 49s) [21:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:40] T303469: VE should preserve .mw-file-description class in Parsoid HTML - https://phabricator.wikimedia.org/T303469 [21:33:41] T292657: file "link=" syntax broken at some wikis - https://phabricator.wikimedia.org/T292657 [21:34:55] !log rzl@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/DiscussionTools/modules/controller.js: Backport: [[gerrit:769559|Fix highlighting of comments when reloading (T303261)]] (duration: 00m 47s) [21:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:58] T303261: After reloading from the new comment (edit conflict) warning message, the tool highlights one comment too many - https://phabricator.wikimedia.org/T303261 [21:35:47] (03CR) 10RLazarus: [C: 03+2] CommonSettings: Update comment about Image Suggestions API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769779 (https://phabricator.wikimedia.org/T294362) (owner: 10Kosta Harlan) [21:36:35] thanks rzl [21:36:43] (03Merged) 10jenkins-bot: CommonSettings: Update comment about Image Suggestions API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/769779 (https://phabricator.wikimedia.org/T294362) (owner: 10Kosta Harlan) [21:39:15] !log rzl@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:769779|CommonSettings: Update comment about Image Suggestions API (T294362)]] (duration: 00m 48s) [21:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:18] T294362: Image Suggestions POC Deprecation & Plan for Production - https://phabricator.wikimedia.org/T294362 [21:41:02] !log UTC late B&C training window done [21:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:35] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:42:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:31] (03PS1) 10Bking: elastic: add missing restart flag [cookbooks] - 10https://gerrit.wikimedia.org/r/769789 (https://phabricator.wikimedia.org/T301955) [21:48:35] (03PS2) 10Ryan Kemper: elastic: add missing restart flag [cookbooks] - 10https://gerrit.wikimedia.org/r/769789 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [21:51:31] (03PS4) 10Razzi: elasticsearch: upgrade relforge to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763479 (https://phabricator.wikimedia.org/T301955) (owner: 10Gehel) [21:52:26] (03CR) 10Ryan Kemper: [C: 03+2] elastic: add missing restart flag [cookbooks] - 10https://gerrit.wikimedia.org/r/769789 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [21:52:33] (03CR) 10Razzi: [C: 03+2] elasticsearch: upgrade relforge to elasticsearch 6.8 [puppet] - 10https://gerrit.wikimedia.org/r/763479 (https://phabricator.wikimedia.org/T301955) (owner: 10Gehel) [21:55:12] (03Merged) 10jenkins-bot: elastic: add missing restart flag [cookbooks] - 10https://gerrit.wikimedia.org/r/769789 (https://phabricator.wikimedia.org/T301955) (owner: 10Bking) [22:00:04] TimStarling: Time to snap out of that daydream and deploy Special backport: global_edit_count patch deployment. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220310T2200). [22:00:21] * Reedy grins [22:00:32] thanks jouncebot [22:02:31] (03PS1) 10Tim Starling: Track global user edit counts in a DB table [extensions/CentralAuth] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769561 (https://phabricator.wikimedia.org/T130439) [22:02:51] (03CR) 10Tim Starling: [C: 03+2] Track global user edit counts in a DB table [extensions/CentralAuth] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769561 (https://phabricator.wikimedia.org/T130439) (owner: 10Tim Starling) [22:02:52] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [22:02:52] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [22:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:56] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [22:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:06] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [22:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:46] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [22:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:59] (03Merged) 10jenkins-bot: Track global user edit counts in a DB table [extensions/CentralAuth] (wmf/1.38.0-wmf.25) - 10https://gerrit.wikimedia.org/r/769561 (https://phabricator.wikimedia.org/T130439) (owner: 10Tim Starling) [22:05:53] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [22:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The option is needed in the non-sni listener. We should remove the feature flag once we've upgraded the fleet to 1.18.x" [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron) [22:08:03] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955 [22:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:07] T301955: Upgrade relforge to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301955 [22:08:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:10:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:57] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 276 threshold =0.15 breach: unassigned_shards: 270, number_of_pending_tasks: 10, active_shards: 17, number_of_in_flight_fetch: 0, relocating_shards: 0, initializing_shards: 6, delayed_unassigned_shards: 0, cluster_name: relforge-eqiad, active_primary_shards: 17, timed_out: False, task_max_waiting_in_queue_millis: 3 [22:13:57] ber_of_nodes: 2, number_of_data_nodes: 2, active_shards_percent_as_number: 5.802047781569966, status: red https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:45] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: number_of_data_nodes: 2, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_primary_shards: 163, status: green, number_of_pending_tasks: 0, timed_out: False, task_max_waiting_in_queue_millis: 0, active_shards: 293, number_of_in_flight_fetch: 0, number_of_nodes: 2, initializing_shards: 0, cluster_name: [22:16:45] e-eqiad, delayed_unassigned_shards: 0, unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:21:38] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/CentralAuth/includes/CentralAuthEditCounter.php: global_edit_count gerrit 769561 (duration: 00m 48s) [22:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:27] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/CentralAuth/includes/ServiceWiring.php: global_edit_count gerrit 769561 (duration: 00m 48s) [22:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:32] I made a list of files in dependency order and I'm doing sync-file on each of them using a for loop [22:23:16] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/CentralAuth/includes/CentralAuthServices.php: global_edit_count gerrit 769561 (duration: 00m 47s) [22:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:29] after the loop is done I'll run scap which will include the extension.json change that activates the new hook handler [22:24:05] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/CentralAuth/includes/Hooks/Handlers/UserEditCountUpdateHookHandler.php: global_edit_count gerrit 769561 (duration: 00m 47s) [22:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:35] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:24:54] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.25/extensions/CentralAuth/includes/User/CentralAuthUser.php: global_edit_count gerrit 769561 (duration: 00m 47s) [22:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:22] !log tstarling@deploy1002 Started scap: global_edit_count gerrit 769561 [22:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:37] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:06] (03CR) 10Bking: elastic: relax & restore perms during upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [22:42:34] !log tstarling@deploy1002 Finished scap: global_edit_count gerrit 769561 (duration: 15m 12s) [22:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:06] (03PS1) 10RLazarus: miscweb: Update envoy to 1.18.3-1 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/769791 [22:59:06] (03CR) 10RLazarus: [C: 03+2] miscweb: Update envoy to 1.18.3-1 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/769791 (owner: 10RLazarus) [23:02:48] (03Merged) 10jenkins-bot: miscweb: Update envoy to 1.18.3-1 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/769791 (owner: 10RLazarus) [23:07:44] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [23:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:11] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [23:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:58] 10SRE, 10serviceops, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10dancy) [23:14:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10dancy) [23:37:35] (03PS1) 10RLazarus: miscweb: Update envoy to 1.18.3-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769793 [23:41:22] (03PS2) 10RLazarus: miscweb: Update envoy to 1.18.3-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769793 (https://phabricator.wikimedia.org/T300324) [23:41:38] (03CR) 10RLazarus: [C: 03+2] miscweb: Update envoy to 1.18.3-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769793 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [23:45:42] (03Merged) 10jenkins-bot: miscweb: Update envoy to 1.18.3-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/769793 (https://phabricator.wikimedia.org/T300324) (owner: 10RLazarus) [23:55:39] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [23:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:34] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [23:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log