[00:00:26] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:47] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Seksen) 05Stalled→03Open p:05Medium→03High Trwiki admin here, admittedly with little-to-no technical knowledge regarding this, just dropping by to not... [00:32:02] PROBLEM - ElasticSearch unassigned shard check - 9200 on relforge1003 is CRITICAL: CRITICAL - queries_01022021[1](2022-12-08T19:59:50.738Z), queries_01022021[3](2022-12-08T19:59:50.738Z), queries_01022021[4](2022-12-08T19:59:50.738Z), queries_01022021[2](2022-12-08T19:59:50.738Z), queries_01022021[0](2022-12-08T19:59:50.738Z), ebernhardson_test[0](2022-12-08T19:59:50.734Z), queries_24012021[1](2022-12-08T19:59:50.732Z), queries_24012021[3 [00:32:02] 2-08T19:59:50.732Z), queries_24012021[4](2022-12-08T19:59:50.732Z), queries_24012021[2](2022-12-08T19:59:50.732Z), queries_24012021[0](2022-12-08T19:59:50.732Z), joined_queries-202201[1](2022-12-08T19:59:50.734Z), joined_queries-202201[2](2022-12-08T19:59:50.734Z), joined_queries-202201[0](2022-12-08T19:59:50.734Z), joined_queries-202212[1](2022-12-08T19:59:50.733Z), joined_queries-202212[2](2022-12-08T19:59:50.733Z), joined_queries-20221 [00:32:02] 2-12-08T19:59:50.733Z), .ltrstore[0](2022-12-08T20:09:47.858Z), .kibana_1[0](2022-12-08T20:09:47.858Z), joined_queries-202204[1](2022-12-08T19:59:50.737Z), joined_queries-202204[2](2022-12-08T19:59:50.737Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:32:26] PROBLEM - ElasticSearch unassigned shard check - 9200 on relforge1004 is CRITICAL: CRITICAL - queries_02022021[2](2022-12-08T19:59:50.732Z), queries_02022021[1](2022-12-08T19:59:50.732Z), queries_02022021[4](2022-12-08T19:59:50.732Z), queries_02022021[3](2022-12-08T19:59:50.732Z), queries_02022021[0](2022-12-08T19:59:50.732Z), queries_01022021[2](2022-12-08T19:59:50.738Z), queries_01022021[1](2022-12-08T19:59:50.738Z), queries_01022021[3] [00:32:26] -08T19:59:50.738Z), queries_01022021[4](2022-12-08T19:59:50.738Z), queries_01022021[0](2022-12-08T19:59:50.738Z), queries_30012021[2](2022-12-08T19:59:50.737Z), queries_30012021[1](2022-12-08T19:59:50.737Z), queries_30012021[3](2022-12-08T19:59:50.737Z), queries_30012021[4](2022-12-08T19:59:50.737Z), queries_30012021[0](2022-12-08T19:59:50.737Z), mw_cirrus_metastore[2](2022-12-08T19:59:50.737Z), mw_cirrus_metastore[1](2022-12-08T19:59:50. [00:32:26] w_cirrus_metastore[3](2022-12-08T19:59:50.737Z), mw_cirrus_metastore[4](2022-12-08T19:59:50.737Z), mw_cirrus_metastore[0](2022-12-08T19:59:50.737Z), queries_23012021[2](2022-12-08T19:59:50.735Z), queries_23 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:36:46] PROBLEM - ElasticSearch unassigned shard check - 9400 on relforge1004 is CRITICAL: CRITICAL - mw_cirrus_metastore[2](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[4](2022-12-08T19:59:50.729Z), mw_cirrus_metastore[1](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[3](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[0](2022-12-08T19:59:50.731Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:36:46] PROBLEM - ElasticSearch unassigned shard check - 9400 on relforge1003 is CRITICAL: CRITICAL - mw_cirrus_metastore[4](2022-12-08T19:59:50.729Z), mw_cirrus_metastore[3](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[2](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[1](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[0](2022-12-08T19:59:50.731Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [00:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:41:46] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:25] (03PS1) 10Tim Starling: Add myself to root-authorized-keys [labs/private] - 10https://gerrit.wikimedia.org/r/866829 [02:46:21] (03CR) 10Legoktm: [C: 03+1] "Matches the key already in LDAP, and Tim has prod root." [labs/private] - 10https://gerrit.wikimedia.org/r/866829 (owner: 10Tim Starling) [03:41:19] (03CR) 10Tim Starling: [C: 03+2] Add myself to root-authorized-keys [labs/private] - 10https://gerrit.wikimedia.org/r/866829 (owner: 10Tim Starling) [03:44:10] (03CR) 10Tim Starling: [V: 03+2 C: 03+2] Add myself to root-authorized-keys [labs/private] - 10https://gerrit.wikimedia.org/r/866829 (owner: 10Tim Starling) [04:15:56] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:52] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Aklapper) p:05High→03Medium [04:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [04:45:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:50:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:11:34] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:10] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1186 power supplies not redundant - https://phabricator.wikimedia.org/T324858 (10Marostegui) Thank you, it looks good now indeed ` ------------------------------------------------------------------------------- Record: 4 Date/Time: 12/10/2022 03:23:29 Source:... [06:14:52] (03PS1) 10Marostegui: Revert "Revert "db1206: Enable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/866695 [06:15:55] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "db1206: Enable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/866695 (owner: 10Marostegui) [06:16:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 1%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42666 and previous config saved to /var/cache/conftool/dbconfig/20221212-061630-root.json [06:22:01] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:31:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42667 and previous config saved to /var/cache/conftool/dbconfig/20221212-063135-root.json [06:32:39] (03PS1) 10Marostegui: production-m2.sql.erb: Add new user [puppet] - 10https://gerrit.wikimedia.org/r/867001 (https://phabricator.wikimedia.org/T324142) [06:34:30] (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add new user [puppet] - 10https://gerrit.wikimedia.org/r/867001 (https://phabricator.wikimedia.org/T324142) (owner: 10Marostegui) [06:46:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42668 and previous config saved to /var/cache/conftool/dbconfig/20221212-064640-root.json [06:58:09] (03PS1) 10KartikMistry: Enable Section Translation in Chuvash Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867002 (https://phabricator.wikimedia.org/T319176) [07:01:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42669 and previous config saved to /var/cache/conftool/dbconfig/20221212-070145-root.json [07:13:00] 10SRE, 10Data-Persistence, 10MediaWiki-extensions-SecurePoll, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), and 2 others: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Ladsgroup) 05Open... [07:16:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42670 and previous config saved to /var/cache/conftool/dbconfig/20221212-071650-root.json [07:31:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42671 and previous config saved to /var/cache/conftool/dbconfig/20221212-073155-root.json [07:47:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42672 and previous config saved to /var/cache/conftool/dbconfig/20221212-074700-root.json [07:51:32] (03PS1) 10Ayounsi: cloud-in filter remove term dumps [homer/public] - 10https://gerrit.wikimedia.org/r/867110 [07:53:19] (03CR) 10Ayounsi: [C: 03+2] cloud-in filter remove term dumps [homer/public] - 10https://gerrit.wikimedia.org/r/867110 (owner: 10Ayounsi) [07:53:52] (03Merged) 10jenkins-bot: cloud-in filter remove term dumps [homer/public] - 10https://gerrit.wikimedia.org/r/867110 (owner: 10Ayounsi) [07:59:54] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 159 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 159, active_shards: 159, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 159, delayed_unassigned_shards: 0, number_of_pending_tasks [07:59:54] ber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 Guillaume Lederrey https://phabricator.wikimedia.org/T324939 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:59:54] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_t [07:59:54] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 Guillaume Lederrey https://phabricator.wikimedia.org/T324939 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:59:54] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on relforge1003 is CRITICAL: CRITICAL - queries_01022021[1](2022-12-08T19:59:50.738Z), queries_01022021[3](2022-12-08T19:59:50.738Z), queries_01022021[4](2022-12-08T19:59:50.738Z), queries_01022021[2](2022-12-08T19:59:50.738Z), queries_01022021[0](2022-12-08T19:59:50.738Z), ebernhardson_test[0](2022-12-08T19:59:50.734Z), queries_24012021[1](2022-12-08T19:59:50.732Z), queries_24 [07:59:54] ](2022-12-08T19:59:50.732Z), queries_24012021[4](2022-12-08T19:59:50.732Z), queries_24012021[2](2022-12-08T19:59:50.732Z), queries_24012021[0](2022-12-08T19:59:50.732Z), joined_queries-202201[1](2022-12-08T19:59:50.734Z), joined_queries-202201[2](2022-12-08T19:59:50.734Z), joined_queries-202201[0](2022-12-08T19:59:50.734Z), joined_queries-202212[1](2022-12-08T19:59:50.733Z), joined_queries-202212[2](2022-12-08T19:59:50.733Z), joined_queri [07:59:55] 2[0](2022-12-08T19:59:50.733Z), .ltrstore[0](2022-12-08T20:09:47.858Z), .kibana_1[0](2022-12-08T20:09:47.858Z), joined_queries-202204[1](2022-12-08T19:59:50.737Z), joined_queries-202204[2](2022-12-08T19:59:50.737Z) Guillaume Lederrey https://phabricator.wikimedia.org/T324939 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:59:55] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9400 on relforge1003 is CRITICAL: CRITICAL - mw_cirrus_metastore[4](2022-12-08T19:59:50.729Z), mw_cirrus_metastore[3](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[2](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[1](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[0](2022-12-08T19:59:50.731Z) Guillaume Lederrey https://phabricator.wikimedia.org/T324939 https://wikitech.wikimedia.org [07:59:55] arch%23Administration [07:59:56] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 159 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 159, active_shards: 159, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 159, delayed_unassigned_shards: 0, number_of_pending_tasks [07:59:56] ber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 Guillaume Lederrey https://phabricator.wikimedia.org/T324939 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:59:57] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_t [07:59:57] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 Guillaume Lederrey https://phabricator.wikimedia.org/T324939 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:59:58] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9200 on relforge1004 is CRITICAL: CRITICAL - queries_02022021[2](2022-12-08T19:59:50.732Z), queries_02022021[1](2022-12-08T19:59:50.732Z), queries_02022021[4](2022-12-08T19:59:50.732Z), queries_02022021[3](2022-12-08T19:59:50.732Z), queries_02022021[0](2022-12-08T19:59:50.732Z), queries_01022021[2](2022-12-08T19:59:50.738Z), queries_01022021[1](2022-12-08T19:59:50.738Z), queries_010 [07:59:58] (2022-12-08T19:59:50.738Z), queries_01022021[4](2022-12-08T19:59:50.738Z), queries_01022021[0](2022-12-08T19:59:50.738Z), queries_30012021[2](2022-12-08T19:59:50.737Z), queries_30012021[1](2022-12-08T19:59:50.737Z), queries_30012021[3](2022-12-08T19:59:50.737Z), queries_30012021[4](2022-12-08T19:59:50.737Z), queries_30012021[0](2022-12-08T19:59:50.737Z), mw_cirrus_metastore[2](2022-12-08T19:59:50.737Z), mw_cirrus_metastore[1](2022-12-08T1 [07:59:59] 737Z), mw_cirrus_metastore[3](2022-12-08T19:59:50.737Z), mw_cirrus_metastore[4](2022-12-08T19:59:50.737Z), mw_cirrus_metastore[0](2022-12-08T19:59:50.737Z), queries_23012021[2](2022-12-08T19:59:50.735Z), queries_23 Guillaume Lederrey https://phabricator.wikimedia.org/T324939 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:59:59] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9400 on relforge1004 is CRITICAL: CRITICAL - mw_cirrus_metastore[2](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[4](2022-12-08T19:59:50.729Z), mw_cirrus_metastore[1](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[3](2022-12-08T19:59:50.731Z), mw_cirrus_metastore[0](2022-12-08T19:59:50.731Z) Guillaume Lederrey https://phabricator.wikimedia.org/T324939 https://wikitech.wikimedia.org [08:00:00] arch%23Administration [08:00:04] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221212T0800). [08:00:05] matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:10] o/ [08:00:36] can you self-service? [08:00:47] yeah, sure! [08:02:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie) [08:03:11] (03Merged) 10jenkins-bot: Add mediawiki.searchpreview schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069) (owner: 10Matthias Mullie) [08:05:15] There's another unexpected patch here: "The following are unexpected commits pulled from origin for /srv/mediawiki-staging: commit d0ad5766b1d9998c0d30412c91439e26678fed34" [08:05:37] Ah, looks like that's a beta-only thing - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/866518 [08:05:46] Continuing [08:06:14] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:845518|Add mediawiki.searchpreview schema (T321069)]] [08:06:18] T321069: [L] SearchPreview instrumentation - Create a Schema - https://phabricator.wikimedia.org/T321069 [08:08:26] !log remove bast5001 from management routers ACLs (replaced by bast5002) [08:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede) [08:08:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/866649 (owner: 10JHathaway) [08:14:18] (03CR) 10Slyngshede: [V: 03+2] Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede) [08:14:21] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede) [08:14:52] (03PS5) 10Slyngshede: Add RQ support to Django [software/bitu] - 10https://gerrit.wikimedia.org/r/853290 [08:15:11] (03CR) 10Slyngshede: [V: 03+2] Add RQ support to Django [software/bitu] - 10https://gerrit.wikimedia.org/r/853290 (owner: 10Slyngshede) [08:15:14] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add RQ support to Django [software/bitu] - 10https://gerrit.wikimedia.org/r/853290 (owner: 10Slyngshede) [08:15:24] (03PS5) 10Slyngshede: WIP: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [08:15:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/866620 (https://phabricator.wikimedia.org/T324753) (owner: 10JHathaway) [08:16:09] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:845518|Add mediawiki.searchpreview schema (T321069)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:16:13] T321069: [L] SearchPreview instrumentation - Create a Schema - https://phabricator.wikimedia.org/T321069 [08:16:51] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10MoritzMuehlenhoff) 05Resolved→03Open We still need a tracking entry in modules/admin/data/data.yaml [08:24:36] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:845518|Add mediawiki.searchpreview schema (T321069)]] (duration: 18m 21s) [08:24:40] T321069: [L] SearchPreview instrumentation - Create a Schema - https://phabricator.wikimedia.org/T321069 [08:25:21] !log UTC morning backports done [08:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:55:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1002.eqiad.wmnet with OS bullseye [08:55:51] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1002.eqiad.wmnet with OS bullseye [09:01:30] (03PS1) 10Elukey: superset: deploy python3-gevent of bullseye [puppet] - 10https://gerrit.wikimedia.org/r/867114 [09:02:53] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38686/console" [puppet] - 10https://gerrit.wikimedia.org/r/867114 (owner: 10Elukey) [09:03:58] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@6af0d2d] (codfw): Increase codfw mirrored traffic to 25% [09:06:13] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@6af0d2d] (codfw): Increase codfw mirrored traffic to 25% (duration: 02m 15s) [09:30:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) 05Resolved→03Open Hi, Checking up on this on the server, it would seem it started failing again immediately: ` 32 | Dec-09-2022 | 13... [09:30:27] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:19] ^ checking [09:31:45] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [09:40:57] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_chartmuseum:prod.service,swift-account-stats_mlserve:prod.service,swift-account-stats_search:platform.service,swift-account-stats_tegola:prod.service,swift-account-stats_thanos:prod.service,swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:01] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2051 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867119 (https://phabricator.wikimedia.org/T293012) [09:43:24] (03CR) 10Effie Mouzeli: [C: 03+1] deployment servers: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866607 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:49:10] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) I've dug into it a bit, and we have 3 brokers per datacenter for kafka-logging, so for balance's sake I'll create... [09:50:42] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti5007 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/866572 (https://phabricator.wikimedia.org/T324610) (owner: 10Muehlenhoff) [09:52:20] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2051 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867119 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [09:52:26] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) ` cgoubert@kafka-logging1001:~$ kafka topics --create --topic mediawiki.http.accesslog --partitions 6 --replicatio... [09:53:47] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:58] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: name=parse1002.eqiad.wmnet [10:00:34] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@bdc19a3] (codfw): Increase codfw mirrored traffic to 50% [10:02:16] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@bdc19a3] (codfw): Increase codfw mirrored traffic to 50% (duration: 01m 42s) [10:04:12] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) [10:04:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) Question about the scope of the cookbook - do we want to aggregate functionalities already present in other coo... [10:06:30] PROBLEM - mediawiki-installation DSH group on parse1002 is CRITICAL: Host parse1002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:10:05] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Host rebooted spontaneously: ` 09:30 <+icinga-wm> PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% 09:31 ^ checking 09:31... [10:10:25] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) ` racadm>>racadm getsel Record: 1 Date/Time: 01/24/2022 17:43:06 Source: system Severity: Ok Description: Log cleared. -----------------... [10:10:44] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) ` cgoubert@parse1002:~$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Jan-24-2022 | 17:43:0... [10:11:48] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) Host depooled: ` cgoubert@cumin1001:~$ sudo confctl select 'name=parse1002.eqiad.wmnet' set/pooled=inactive The selector you chose has selected the fo... [10:16:02] (03CR) 10Elukey: Upgrade knative to 1.7.2 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [10:16:04] (03PS15) 10Elukey: Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [10:17:25] !log depooled parse1002.eqiad.wmnet for hw failure - T324949 [10:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] T324949: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 [10:18:57] !log Update modify-mfa tools, https://gerrit.wikimedia.org/r/c/operations/puppet/+/861385 [10:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:09] (03CR) 10Slyngshede: [C: 03+2] ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [10:22:01] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:24:05] !log update add-ldap-group tool, https://gerrit.wikimedia.org/r/c/operations/puppet/+/860568 [10:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:12] (03CR) 10Slyngshede: [C: 03+2] C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [10:26:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [10:29:14] (03CR) 10Elukey: "Hi Alex! I am reviewing tasks for the 1.23 upgrade, are you planning to work on this or do you prefer me to take over?" [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [10:32:38] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867122 [10:35:09] jouncebot: nowandnext [10:35:09] No deployments scheduled for the next 3 hour(s) and 24 minute(s) [10:35:09] In 3 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221212T1400) [10:35:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [10:36:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [10:36:45] (03CR) 10Ladsgroup: [C: 03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867122 (owner: 10Ladsgroup) [10:37:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867122 (owner: 10Ladsgroup) [10:37:33] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867122 (owner: 10Ladsgroup) [10:37:48] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:867122|Bump portals to HEAD]] [10:39:09] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/867123 (https://phabricator.wikimedia.org/T324949) (owner: 10Clément Goubert) [10:39:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1002.eqiad.wmnet with reason: host reimage [10:39:37] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:867122|Bump portals to HEAD]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [10:39:53] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update the spark images to remove upstream support for the webhook [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864770 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [10:40:16] (03CR) 10Btullis: [V: 03+1 C: 03+2] Configure the kube_env file for the spark-operator namespace [puppet] - 10https://gerrit.wikimedia.org/r/854505 (https://phabricator.wikimedia.org/T321686) (owner: 10Btullis) [10:40:22] (03PS2) 10Btullis: Configure the kube_env file for the spark-operator namespace [puppet] - 10https://gerrit.wikimedia.org/r/854505 (https://phabricator.wikimedia.org/T321686) [10:40:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] scap/conftool: Switch parsoid canary to parse1003 [puppet] - 10https://gerrit.wikimedia.org/r/867123 (https://phabricator.wikimedia.org/T324949) (owner: 10Clément Goubert) [10:42:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1002.eqiad.wmnet with reason: host reimage [10:43:33] (03CR) 10Btullis: [C: 03+2] superset: deploy python3-gevent of bullseye [puppet] - 10https://gerrit.wikimedia.org/r/867114 (owner: 10Elukey) [10:47:36] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:867122|Bump portals to HEAD]] (duration: 09m 48s) [10:49:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti5007.eqsin.wmnet [10:50:18] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] scap/conftool: Switch parsoid canary to parse1003 [puppet] - 10https://gerrit.wikimedia.org/r/867123 (https://phabricator.wikimedia.org/T324949) (owner: 10Clément Goubert) [10:50:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) Will take another look at server when I get in today. [10:50:34] PROBLEM - Check systemd state on ganeti5007 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [10:51:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:52:55] !log Switched parse1002 to parse1003 in parsoid-canary - T324949 [10:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:59] T324949: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 [10:56:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1002.eqiad.wmnet with OS bullseye [10:58:02] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1002.eqiad.wmnet with OS bullseye completed: - thanos-be1002 (**PASS**) - Downtimed on Icinga/Alertmanager... [10:58:31] (03CR) 10Hashar: [C: 03+2] Document how to test a JavaScript Gerrit plugin [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860885 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [10:58:36] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:38] RECOVERY - Check systemd state on ganeti5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [10:58:58] (03PS1) 10Slyngshede: C:ldap::client::utils add missing directory. [puppet] - 10https://gerrit.wikimedia.org/r/867125 [10:59:06] (03Merged) 10jenkins-bot: Document how to test a JavaScript Gerrit plugin [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860885 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [11:00:41] (03PS3) 10Volans: cumin: add an audit report for insetup servers [puppet] - 10https://gerrit.wikimedia.org/r/864729 [11:01:16] (03PS4) 10Volans: cumin: add an audit report for insetup servers [puppet] - 10https://gerrit.wikimedia.org/r/864729 [11:01:18] (03PS2) 10Volans: profile::cumin: use bool2str to simplify code [puppet] - 10https://gerrit.wikimedia.org/r/865728 [11:02:06] (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans) [11:02:24] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/865728 (owner: 10Volans) [11:05:56] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:07:29] (03PS1) 10Ayounsi: Avoid Tata->Free path [homer/public] - 10https://gerrit.wikimedia.org/r/867128 (https://phabricator.wikimedia.org/T324955) [11:07:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:07:50] 10SRE, 10Traffic: Varnish wrongly reports x-cache/x-cache-status in some scenarios - https://phabricator.wikimedia.org/T324956 (10Vgutierrez) [11:09:14] 10SRE, 10Traffic: Varnish wrongly reports x-cache/x-cache-status in some scenarios - https://phabricator.wikimedia.org/T324956 (10Vgutierrez) p:05Triage→03Medium [11:09:23] (03CR) 10Volans: [C: 03+2] cumin: add an audit report for insetup servers [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans) [11:09:29] (03PS2) 10Ayounsi: Avoid Tata->Free path [homer/public] - 10https://gerrit.wikimedia.org/r/867128 (https://phabricator.wikimedia.org/T324955) [11:09:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans) [11:10:28] (03PS1) 10Slyngshede: C:ldap::client::utils rollback updated LDAP tools [puppet] - 10https://gerrit.wikimedia.org/r/867134 [11:10:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5007.eqsin.wmnet to cluster eqsin and group 1 [11:11:24] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867134 (owner: 10Slyngshede) [11:12:20] PROBLEM - swift eqiad object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe1001 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [11:12:30] (03CR) 10Slyngshede: [C: 03+2] C:ldap::client::utils rollback updated LDAP tools [puppet] - 10https://gerrit.wikimedia.org/r/867134 (owner: 10Slyngshede) [11:12:58] PROBLEM - swift eqiad container availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe1001 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [11:13:32] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5007.eqsin.wmnet to cluster eqsin and group 1 [11:13:40] Emperor: ^^ should be worried about that? [11:14:05] vgutierrez: Huh, not seen that alert before [11:14:10] RECOVERY - swift eqiad object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [11:14:43] oh, if that's just container dispersion, then yes it's expected, I've been reimaging thanos-be nodes [11:14:48] RECOVERY - swift eqiad container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [11:15:04] (03CR) 10Volans: [C: 03+2] profile::cumin: use bool2str to simplify code [puppet] - 10https://gerrit.wikimedia.org/r/865728 (owner: 10Volans) [11:16:15] (03CR) 10JMeybohm: [C: 03+2] kubeadm: Declare /etc/kubernetes directory resource directly [puppet] - 10https://gerrit.wikimedia.org/r/865619 (owner: 10JMeybohm) [11:16:34] (03CR) 10JMeybohm: [C: 03+2] pki: Allow to override the default expiry per intermediate [puppet] - 10https://gerrit.wikimedia.org/r/865075 (owner: 10JMeybohm) [11:16:52] Emperor: ack [11:19:19] (03CR) 10Muehlenhoff: [C: 03+2] deployment servers: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866607 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:20:24] (03PS3) 10Ayounsi: drmrs: offload traffic from Tata [homer/public] - 10https://gerrit.wikimedia.org/r/867128 (https://phabricator.wikimedia.org/T324955) [11:30:28] (03PS1) 10Slyngshede: C:ldap::management remove python3-bitu-ldap. [puppet] - 10https://gerrit.wikimedia.org/r/867137 [11:32:03] (03PS1) 10Muehlenhoff: Remove LDAP access for vyuen [puppet] - 10https://gerrit.wikimedia.org/r/867138 [11:32:53] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/867136 (https://phabricator.wikimedia.org/T324439) (owner: 10Clément Goubert) [11:34:00] (03CR) 10David Caro: [C: 03+1] "LGTM, would be interesting to investigate why or where to get it from if it's needed" [puppet] - 10https://gerrit.wikimedia.org/r/867137 (owner: 10Slyngshede) [11:34:38] 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) >>! In T324439#8455654, @colewhite wrote: > At the beginning, we should configure logstash t... [11:35:22] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for vyuen [puppet] - 10https://gerrit.wikimedia.org/r/867138 (owner: 10Muehlenhoff) [11:38:57] (03PS1) 10David Caro: Revert "cumin: add an audit report for insetup servers" [puppet] - 10https://gerrit.wikimedia.org/r/866699 [11:40:08] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on parse1002.eqiad.wmnet with reason: Bad CPU [11:40:22] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on parse1002.eqiad.wmnet with reason: Bad CPU [11:40:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cafc663b-25d8-4e28-8aea-f704dec7742e) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 1 host... [11:40:33] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T324752 (10ayounsi) 05Open→03Resolved a:03ayounsi I deleted it from https://librenms.wikimedia.org/ports/deleted=yes let me know if it happens again. [11:40:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Clement_Goubert) All yours DC-Ops :) [11:40:56] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations: ganeti500[567] implementation tracking - https://phabricator.wikimedia.org/T324610 (10MoritzMuehlenhoff) [11:42:26] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2050 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867140 (https://phabricator.wikimedia.org/T293012) [11:43:42] !log failover Ganeti master in eqsin to ganeti5004 (5003 will be decommissioned) T322048 [11:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:46] T322048: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 [11:45:43] (03CR) 10Slyngshede: [C: 03+2] C:ldap::management remove python3-bitu-ldap. [puppet] - 10https://gerrit.wikimedia.org/r/867137 (owner: 10Slyngshede) [11:46:16] PROBLEM - ganeti-wconfd running on ganeti5003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:49:04] ^ expected [11:49:20] !log drain ganeti5003 for eventual decom T322048 [11:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:23] T322048: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 [11:50:58] (03Abandoned) 10Muehlenhoff: cas: Update to 6.6.0 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/830236 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [11:52:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) >>! In T277677#8459708, @elukey wrote: > Question about the scope of the cookbook - do we want to aggregate f... [11:54:47] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon) [11:56:28] (03PS1) 10Lucas Werkmeister (WMDE): query_service: support downloads in query builder [puppet] - 10https://gerrit.wikimedia.org/r/867142 (https://phabricator.wikimedia.org/T323451) [11:59:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1003.eqiad.wmnet with OS bullseye [11:59:57] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1003.eqiad.wmnet with OS bullseye [12:04:04] !log installing twisted security updates [12:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:46] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:13:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1003.eqiad.wmnet with reason: host reimage [12:13:57] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Clement_Goubert) >>! In T288851#7742391, @Krinkle wrote: >>>! In T288164#7742387, @Krinkle wrote: >> For the record, the logs from k8s-mwdebug p... [12:15:02] (03CR) 10Jbond: "minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott) [12:15:39] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10LSobanski) [12:16:03] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1003.eqiad.wmnet with reason: host reimage [12:19:01] !log installing jqueryui security updates [12:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:54] (03PS1) 10Muehlenhoff: blackbox smoke tests: Switch to ganeti5007 for rack 603 [puppet] - 10https://gerrit.wikimedia.org/r/867154 (https://phabricator.wikimedia.org/T322048) [12:23:56] (03PS5) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/865591 [12:23:58] (03PS9) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [12:24:00] (03PS4) 10JMeybohm: k8s: Remove authz_mode hiera key [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) [12:24:02] (03PS1) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/867155 [12:26:19] (03CR) 10Jbond: [C: 03+1] pki: Add intermediates for wikikube and wikikube staging (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/867155 (owner: 10JMeybohm) [12:26:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm) [12:30:18] (03CR) 10Alexandros Kosiaris: "Last comment had the wrong link, the correct one is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/858294" [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris) [12:30:55] (03CR) 10Jbond: "minor nit, otherwise looks good" [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott) [12:31:16] (03PS4) 10Aqu: HDFS FSImage is backed up to HDFS on monday [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [12:32:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1003.eqiad.wmnet with OS bullseye [12:32:19] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1003.eqiad.wmnet with OS bullseye completed: - thanos-be1003 (**PASS**) - Downtimed on Icinga/Alertmanager... [12:32:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [12:33:20] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon) [12:34:29] (03PS1) 10AikoChou: ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/867156 (https://phabricator.wikimedia.org/T323023) [12:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:48:30] !log installing Django security updates [12:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:44] !log installing jackson-databind security updates [12:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:06] (03CR) 10Muehlenhoff: [C: 03+2] blackbox smoke tests: Switch to ganeti5007 for rack 603 [puppet] - 10https://gerrit.wikimedia.org/r/867154 (https://phabricator.wikimedia.org/T322048) (owner: 10Muehlenhoff) [13:01:39] (03CR) 10Aqu: HDFS FSImage is backed up to HDFS on monday (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [13:03:30] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations: ganeti500[567] implementation tracking - https://phabricator.wikimedia.org/T324610 (10MoritzMuehlenhoff) 05Open→03Resolved The three new servers have been added to the eqsin cluster. [13:16:56] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2050 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867140 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [13:17:47] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2050 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867140 (https://phabricator.wikimedia.org/T293012) [13:21:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to 2022-12-06-121330-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865063 (https://phabricator.wikimedia.org/T321781) (owner: 10KartikMistry) [13:22:21] (03PS2) 10DCausse: Update ltr plugin to 7.10.2-wmf1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/865178 (https://phabricator.wikimedia.org/T324247) (owner: 10Ebernhardson) [13:26:48] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10MoritzMuehlenhoff) [13:27:08] 10SRE, 10Infrastructure-Foundations: Migrate SSH bastions to Bullseye - https://phabricator.wikimedia.org/T324974 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [13:28:06] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/867161 [13:31:27] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1004.eqiad.wmnet with OS bullseye [13:31:31] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1004.eqiad.wmnet with OS bullseye [13:31:49] (03CR) 10DCausse: [C: 03+1] "lgtm," [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/865178 (https://phabricator.wikimedia.org/T324247) (owner: 10Ebernhardson) [13:33:01] (03PS9) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [13:33:46] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/867161 (owner: 10Muehlenhoff) [13:34:54] (03CR) 10Raymond Ndibe: "merging this soon it that is ok by everyone" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [13:35:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] conftool: add kubernetes nodes as thumbor nodes [puppet] - 10https://gerrit.wikimedia.org/r/866445 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [13:35:24] (03CR) 10Giuseppe Lavagetto: trafficserver: move test2wiki to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [13:36:52] (03CR) 10Giuseppe Lavagetto: trafficserver: move test2wiki to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [13:37:00] (03PS2) 10Giuseppe Lavagetto: trafficserver: move test2wiki to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) [13:37:28] (03CR) 10Giuseppe Lavagetto: "I would like to schedule the deployment of this patch on tuesday december 13th." [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [13:42:09] (03PS12) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [13:44:21] (03PS1) 10Jbond: systemd::timer: only validate intervals if the timer is present [puppet] - 10https://gerrit.wikimedia.org/r/867165 [13:45:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1004.eqiad.wmnet with reason: host reimage [13:47:05] (03CR) 10Vgutierrez: trafficserver: move test2wiki to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/862845 (https://phabricator.wikimedia.org/T290536) (owner: 10Giuseppe Lavagetto) [13:47:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1004.eqiad.wmnet with reason: host reimage [13:48:06] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti5003.eqsin.wmnet with reason: Remove for eventual decom [13:48:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti5003.eqsin.wmnet with reason: Remove for eventual decom [13:56:33] (03PS3) 10Tsevener: Add event stream config for ios.talk_page_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866489 (https://phabricator.wikimedia.org/T324340) [13:56:43] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:58:37] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: (Dis)respected human, time to deploy UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221212T1400). Please do the needful. [14:00:04] MichaelG_WMDE and toni_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:13] o/ [14:00:25] hey [14:00:31] I can deploy! [14:00:34] hi [14:00:38] hi :) [14:00:46] are there any other patches waiting? [14:00:56] if so, I would maybe have them go first [14:01:08] I want to check one last detail that I just noticed [14:01:12] there’s one other change [14:01:16] ok, then let’s go with toni_ first [14:01:40] sounds good [14:02:28] `scap backport` complains that the patch depends on a change that’s not in a production branch [14:02:49] well, wait [14:03:01] (03PS1) 10Volans: cumin: use ssh config restrict [puppet] - 10https://gerrit.wikimedia.org/r/867168 [14:03:03] (03PS1) 10Volans: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) [14:03:03] I’m not sure if “production branch” applies to schemas/event/secondary.git [14:03:05] (03PS1) 10Volans: cumin::cloud_target: add a new profile [puppet] - 10https://gerrit.wikimedia.org/r/867170 (https://phabricator.wikimedia.org/T319401) [14:03:18] hmm, yeah it's this one https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/857759 - it was merged last week [14:03:27] I don’t see any wmf branches at https://gerrit.wikimedia.org/g/schemas/event/secondary [14:03:40] so that probably doesn’t use the train? [14:03:49] (03CR) 10Hashar: [C: 03+2] Replace ESLint built-in jsdoc by the plugin version [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860976 (owner: 10Hashar) [14:04:21] (03Merged) 10jenkins-bot: Replace ESLint built-in jsdoc by the plugin version [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/860976 (owner: 10Hashar) [14:04:51] I’ll tell scap to go ahead then [14:05:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866489 (https://phabricator.wikimedia.org/T324340) (owner: 10Tsevener) [14:05:07] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: allow logging ECS to a local rsyslog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/864547 (https://phabricator.wikimedia.org/T265876) (owner: 10Giuseppe Lavagetto) [14:05:40] (03Merged) 10jenkins-bot: Add event stream config for ios.talk_page_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866489 (https://phabricator.wikimedia.org/T324340) (owner: 10Tsevener) [14:05:55] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:866489|Add event stream config for ios.talk_page_interaction (T324340)]] [14:05:59] T324340: Add event stream config for iOS Talk Pages - https://phabricator.wikimedia.org/T324340 [14:07:32] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and tsev: Backport for [[gerrit:866489|Add event stream config for ios.talk_page_interaction (T324340)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:07:37] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 159, active_shards: 315, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 1, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_ma [14:07:37] g_in_queue_millis: 0, active_shards_percent_as_number: 99.05660377358491 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:07:46] toni_: should be on mwdebug now, can you test it? [14:08:17] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 159, active_shards: 315, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 1, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_ma [14:08:17] g_in_queue_millis: 0, active_shards_percent_as_number: 99.05660377358491 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:08:18] done - looks good to me [14:08:24] I was able to test I mean [14:08:27] ok, thanks! [14:08:37] thank you! [14:08:44] syncing [14:09:38] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/867165 (owner: 10Jbond) [14:09:58] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: fix quoting [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/867172 [14:10:28] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: fix quoting [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/867172 (owner: 10Giuseppe Lavagetto) [14:12:35] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@a2ebe75] (codfw): Increase codfw mirrored traffic to 75% [14:12:39] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2049 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867173 (https://phabricator.wikimedia.org/T293012) [14:13:10] ok, I checked what I wanted, happy to move forward with my patch [14:13:27] ok [14:13:30] still syncing [14:13:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) 05Open→03Resolved @Clement_Goubert Swapped power supply out of recently decom Server looks to have resolved issue [14:13:50] take your time [14:14:16] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@a2ebe75] (codfw): Increase codfw mirrored traffic to 75% (duration: 01m 41s) [14:14:37] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:866489|Add event stream config for ios.talk_page_interaction (T324340)]] (duration: 08m 42s) [14:14:41] T324340: Add event stream config for iOS Talk Pages - https://phabricator.wikimedia.org/T324340 [14:14:43] there we go [14:15:44] (03PS5) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [14:16:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 11686 [14:16:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [14:17:29] (03Merged) 10jenkins-bot: Wikidata: don't show Vector search thumbnails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [14:17:44] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:848421|Wikidata: don't show Vector search thumbnails (T316093)]] [14:17:47] T316093: Make new Vector search use wbsearchentities on Wikidata - https://phabricator.wikimedia.org/T316093 [14:17:59] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 11686 [14:18:31] !log rebalance Ganeti cluster in eqsin after adding ganeti5005-5007 T324610 [14:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:34] T324610: ganeti500[567] implementation tracking - https://phabricator.wikimedia.org/T324610 [14:18:50] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 62887 [14:19:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 62887 [14:19:22] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and migr: Backport for [[gerrit:848421|Wikidata: don't show Vector search thumbnails (T316093)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:19:32] MichaelG_WMDE: ^ [14:19:42] * MichaelG_WMDE looks [14:19:42] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2049 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867173 (https://phabricator.wikimedia.org/T293012) [14:20:41] looks good to me (though it seemed to require a Ctrl+F5) [14:20:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 21804 [14:20:56] no thumbnails anymore on test.wikidata.org on the debug server \o/ [14:21:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 21804 [14:21:13] I think this cannot be tested on www.wikidata.org, right? [14:21:20] I'm fine with moving this forward [14:21:29] yeah, I think so [14:21:32] I also tested on test wikidata [14:21:35] ok, going ahead [14:22:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 20804 [14:22:01] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 20804 [14:22:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38757 [14:23:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38757 [14:23:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 209275 [14:23:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 209275 [14:23:57] (03CR) 10Volans: "Thanks for the python3 migration" [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [14:24:01] (03CR) 10Volans: [C: 03+1] gerrit: script to report on git gc durations [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [14:24:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr Opened Dell support ticket Confirmed: Service Request 158148016 was successfully submitted [14:27:31] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:848421|Wikidata: don't show Vector search thumbnails (T316093)]] (duration: 09m 47s) [14:27:36] T316093: Make new Vector search use wbsearchentities on Wikidata - https://phabricator.wikimedia.org/T316093 [14:27:51] !log UTC afternoon backport+config window done [14:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/867168 (owner: 10Volans) [14:29:40] (03CR) 10Volans: "PCC: https://puppet-compiler.wmflabs.org/output/867169/38693/" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [14:29:59] (03CR) 10JMeybohm: [C: 03+2] pki: Add intermediates for wikikube and wikikube staging (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/867155 (owner: 10JMeybohm) [14:32:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:33:01] (03PS3) 10Hnowlan: conftool: add kubernetes nodes as thumbor nodes [puppet] - 10https://gerrit.wikimedia.org/r/866445 (https://phabricator.wikimedia.org/T233196) [14:33:29] (03PS1) 10Volans: keyholder: adjust comment for cloud_cumin_master [labs/private] - 10https://gerrit.wikimedia.org/r/867180 (https://phabricator.wikimedia.org/T323483) [14:33:34] (03CR) 10Alexandros Kosiaris: "Since I am not aware yet of the new approach regarding templates works, adding joe who created it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/865654 (owner: 10Awight) [14:33:52] RECOVERY - IPMI Sensor Status on restbase1018 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:34:20] (03CR) 10FNegri: [V: 03+2 C: 03+2] keyholder: adjust comment for cloud_cumin_master [labs/private] - 10https://gerrit.wikimedia.org/r/867180 (https://phabricator.wikimedia.org/T323483) (owner: 10Volans) [14:35:43] (03CR) 10Volans: "PCC: https://puppet-compiler.wmflabs.org/output/867170/38694/" [puppet] - 10https://gerrit.wikimedia.org/r/867170 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [14:35:57] (03CR) 10Volans: [C: 03+2] cumin: use ssh config restrict [puppet] - 10https://gerrit.wikimedia.org/r/867168 (owner: 10Volans) [14:38:28] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:38:51] (03CR) 10Elukey: [C: 03+2] ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/867156 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [14:41:12] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DBu-WMF) Hey. Everyone. Where is this exactly. It would be good to get this click-through tracking done very soon. What would it take to... [14:43:35] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/866811 [14:44:13] Lucas_WMDE: deploy done? [14:44:22] yup [14:44:22] I see the done message now, perfect [14:44:26] (03CR) 10Krinkle: [C: 03+2] Add Largest Contentful Paint (LCP) [extensions/NavigationTiming] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866480 (https://phabricator.wikimedia.org/T319329) (owner: 10Krinkle) [14:44:28] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2047 & mc2048 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867181 (https://phabricator.wikimedia.org/T293012) [14:44:31] Thanks! [14:45:24] (03PS6) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/865591 [14:45:26] (03PS10) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [14:45:28] (03PS5) 10JMeybohm: k8s: Remove authz_mode hiera key [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) [14:45:30] (03PS1) 10JMeybohm: pki: Add intermediate certifikates for wikikube and wikikube_staging [puppet] - 10https://gerrit.wikimedia.org/r/867182 (https://phabricator.wikimedia.org/T307943) [14:46:34] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: tinyrgb is distributed via puppet - https://phabricator.wikimedia.org/T323775 (10hnowlan) 05In progress→03Resolved [14:46:38] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [14:47:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 21320 [14:48:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 21320 [14:48:18] (03CR) 10JMeybohm: [C: 03+2] pki: Add intermediates for wikikube and wikikube staging (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm) [14:49:01] (03CR) 10JMeybohm: [C: 03+2] pki: Add intermediate certifikates for wikikube and wikikube_staging [puppet] - 10https://gerrit.wikimedia.org/r/867182 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:50:00] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2049 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867173 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [14:50:43] jayme I think I beat you to it [14:50:45] too slow [14:50:48] shall I merge ? [14:50:48] you did [14:50:54] go ahead please [14:51:04] (03CR) 10JHathaway: [C: 03+2] Add Stephanie Delbecque to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/866620 (https://phabricator.wikimedia.org/T324753) (owner: 10JHathaway) [14:51:06] only because you said pleae [14:51:09] please* [14:51:11] (03PS2) 10JHathaway: Add Stephanie Delbecque to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/866620 (https://phabricator.wikimedia.org/T324753) [14:51:14] (03CR) 10JHathaway: [V: 03+2] Add Stephanie Delbecque to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/866620 (https://phabricator.wikimedia.org/T324753) (owner: 10JHathaway) [14:52:13] Lucas_WMDE: No issue when deploying with parsoid canaries? I switched one this morning, just checking I didn't break stuff [14:52:22] I didn’t notice anything [14:52:31] I still have the terminal open, if you want the scap output copied [14:52:49] (03CR) 10JHathaway: [C: 03+2] Add Kwaku Addo Ofori to ops & wmf [puppet] - 10https://gerrit.wikimedia.org/r/866649 (owner: 10JHathaway) [14:52:58] (03PS2) 10JHathaway: Add Kwaku Addo Ofori to ops & wmf [puppet] - 10https://gerrit.wikimedia.org/r/866649 [14:53:03] (doesn’t look like there was any output matching “parsoid” or “Parsoid”) [14:53:05] (03CR) 10JHathaway: [V: 03+2] Add Kwaku Addo Ofori to ops & wmf [puppet] - 10https://gerrit.wikimedia.org/r/866649 (owner: 10JHathaway) [14:53:17] Lucas_WMDE: how about just parse ? [14:53:36] no matches even for “ars” [14:53:40] Meh [14:53:59] scap output must not reference the backend appserver types probably [14:54:11] If it didn't break it should be ok :p [14:54:23] I did see lots of helmfile output that I hadn’t seen before [14:54:51] but I’m assuming that’s harmless [14:55:09] yep, none of that is in prod [14:56:02] (03PS5) 10Aqu: HDFS FSImage is backed up to HDFS on monday [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [14:57:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 21804 [14:58:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 21804 [14:59:11] (03Merged) 10jenkins-bot: Add Largest Contentful Paint (LCP) [extensions/NavigationTiming] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866480 (https://phabricator.wikimedia.org/T319329) (owner: 10Krinkle) [14:59:38] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38696/console" [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:59:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) Last ipmi-sel log line is: `51 | Dec-12-2022 | 12:59:47 | PS Redundancy | Power Supply | Fully Redundant` Icinga all gree... [15:00:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by krinkle@deploy1002 using scap backport" [extensions/NavigationTiming] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866480 (https://phabricator.wikimedia.org/T319329) (owner: 10Krinkle) [15:00:53] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:866480|Add Largest Contentful Paint (LCP) (T319329)]] [15:00:56] T319329: Expand navigation timing metrics to include user experience metrics and modernise navigation timing - https://phabricator.wikimedia.org/T319329 [15:02:31] !log krinkle@deploy1002 krinkle and krinkle: Backport for [[gerrit:866480|Add Largest Contentful Paint (LCP) (T319329)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [15:02:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38695/console" [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [15:05:59] (03PS1) 10Hnowlan: thumbor: fix metric labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/867186 (https://phabricator.wikimedia.org/T233196) [15:06:11] (03PS18) 10Hashar: Replace CI results table by Gerrit Check API [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) [15:06:13] (03PS8) 10Hashar: Add unit testing with QUnit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 [15:06:45] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:09:44] !log krinkle@deploy1002 Finished scap: Backport for [[gerrit:866480|Add Largest Contentful Paint (LCP) (T319329)]] (duration: 08m 51s) [15:09:48] T319329: Expand navigation timing metrics to include user experience metrics and modernise navigation timing - https://phabricator.wikimedia.org/T319329 [15:10:41] (03CR) 10Clément Goubert: [C: 03+1] thumbor: fix metric labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/867186 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:15:12] (03CR) 10Hnowlan: [C: 03+2] thumbor: fix metric labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/867186 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:16:03] (03PS6) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [15:16:11] (03CR) 10Arturo Borrero Gonzalez: base::cloud_production: introduce new profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:16:40] (03CR) 10Hashar: "Kosta reported https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/865596/ was showing some errors even though everything p" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [15:17:15] (03CR) 10Arturo Borrero Gonzalez: cumin::cloud_target: add a new profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867170 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:18:40] (03CR) 10Hashar: Add unit testing with QUnit (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 (owner: 10Hashar) [15:20:16] (03Merged) 10jenkins-bot: thumbor: fix metric labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/867186 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [15:21:31] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2049 to memcached cluster (fix) [puppet] - 10https://gerrit.wikimedia.org/r/867187 [15:23:09] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2049 to memcached cluster (fix) [puppet] - 10https://gerrit.wikimedia.org/r/867187 (owner: 10Effie Mouzeli) [15:24:10] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:27:05] (03CR) 10Andrew Bogott: [C: 03+2] rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [15:33:36] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [15:34:15] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:35:49] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [15:36:37] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:37:01] (03PS13) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [15:38:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::designate: update firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863843 (owner: 10Majavah) [15:39:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38701/console" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:40:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::keystone: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863844 (owner: 10Majavah) [15:41:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::glance: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863845 (owner: 10Majavah) [15:42:49] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2047 & mc2048 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867181 (https://phabricator.wikimedia.org/T293012) [15:43:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::cinder: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863846 (owner: 10Majavah) [15:43:35] (03PS3) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2047 & mc2048 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867181 (https://phabricator.wikimedia.org/T293012) [15:43:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::trove: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863847 (owner: 10Majavah) [15:43:48] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [15:44:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: Remove authz_mode hiera key [puppet] - 10https://gerrit.wikimedia.org/r/866444 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [15:44:48] (03PS15) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [15:44:50] (03PS9) 10Andrew Bogott: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) [15:44:52] (03PS2) 10Andrew Bogott: Turn on central auth logging for all eqiad1 VMs [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) [15:45:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::radosgw: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863848 (owner: 10Majavah) [15:46:41] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:47:08] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [15:47:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::barbican: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863849 (owner: 10Majavah) [15:48:18] (03CR) 10FNegri: base::cloud_production: introduce new profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:49:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::heat: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863850 (owner: 10Majavah) [15:49:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38703/console" [puppet] - 10https://gerrit.wikimedia.org/r/867165 (owner: 10Jbond) [15:50:36] (03PS2) 10Jbond: systemd::timer: only validate intervals if the timer is present [puppet] - 10https://gerrit.wikimedia.org/r/867165 [15:51:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::magnum: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863851 (owner: 10Majavah) [15:51:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::neutron: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863852 (owner: 10Majavah) [15:51:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::nova: metadata: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863853 (owner: 10Majavah) [15:52:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38705/console" [puppet] - 10https://gerrit.wikimedia.org/r/867165 (owner: 10Jbond) [15:52:30] (03PS1) 10Hashar: posgresql: properly confine fact [puppet] - 10https://gerrit.wikimedia.org/r/867197 (https://phabricator.wikimedia.org/T324571) [15:52:45] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::timer: only validate intervals if the timer is present [puppet] - 10https://gerrit.wikimedia.org/r/867165 (owner: 10Jbond) [15:53:13] (03CR) 10Arturo Borrero Gonzalez: P:openstack::placement: add explicit firewall rules for haproxy_nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863854 (owner: 10Majavah) [15:53:14] arturo: happy for me to merge your changes [15:53:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855 (owner: 10Majavah) [15:54:21] (03PS4) 10Arturo Borrero Gonzalez: P:openstack::galera: add explicit firewall rules for haproxy_nodes [puppet] - 10https://gerrit.wikimedia.org/r/863855 (owner: 10Majavah) [15:54:37] (03CR) 10Hashar: "I think the issue comes from Iede3d0263c0c8abe5000dc16f96781d58406c1b5" [puppet] - 10https://gerrit.wikimedia.org/r/867197 (https://phabricator.wikimedia.org/T324571) (owner: 10Hashar) [15:55:40] (03PS2) 10Muehlenhoff: Set role_contacts for apifeatureusage::logstash [puppet] - 10https://gerrit.wikimedia.org/r/863329 [15:55:57] (03CR) 10Cwhite: [C: 03+1] "PCC checks out: https://puppet-compiler.wmflabs.org/output/865174/38704/" [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [15:57:22] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [15:57:32] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts netmon2001.wikimedia.org [15:57:55] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:58:13] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [15:59:22] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [16:01:12] (03CR) 10Jbond: [V: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:01:52] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [16:01:57] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [16:01:57] (03CR) 10Arturo Borrero Gonzalez: base::cloud_production: introduce new profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:02:12] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [16:02:54] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [16:04:01] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:02] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts netmon2001.wikimedia.org [16:05:11] (03CR) 10Jbond: [C: 03+2] "thanks hashar lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/867197 (https://phabricator.wikimedia.org/T324571) (owner: 10Hashar) [16:06:46] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:00] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [16:12:39] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [16:15:04] (03PS1) 10JHathaway: Add Muhammad Jaziraly to wmde and nda [puppet] - 10https://gerrit.wikimedia.org/r/867200 (https://phabricator.wikimedia.org/T324477) [16:15:36] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [16:16:47] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [16:17:06] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10jhathaway) 05Open→03In progress @MoritzMuehlenhoff thanks for the reminder, patch cut! [16:17:55] (03CR) 10Majavah: P:openstack::placement: add explicit firewall rules for haproxy_nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863854 (owner: 10Majavah) [16:18:15] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1004.eqiad.wmnet with OS bullseye [16:18:21] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1004.eqiad.wmnet with OS bullseye completed: - thanos-be1004 (**PASS**) - Downtimed on Icinga/Alertmanager... [16:19:44] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38708/console" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:21:00] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/858371 (owner: 10PipelineBot) [16:21:04] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/860591 (owner: 10PipelineBot) [16:21:14] (03Abandoned) 10Jforrester: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/840568 (owner: 10PipelineBot) [16:21:46] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:03] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [16:24:21] (03PS1) 10Daniel Kinzler: Disable writing parsoid html to PC on commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867202 [16:24:34] Amir1: --^ [16:24:55] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [16:25:05] (03CR) 10CI reject: [V: 04-1] Disable writing parsoid html to PC on commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867202 (owner: 10Daniel Kinzler) [16:26:10] I will deploy it after my meeting [16:26:19] (03PS2) 10Daniel Kinzler: Disable writing parsoid html to PC on commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867202 [16:28:15] (03CR) 10FNegri: cumin::cloud_target: add a new profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867170 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:29:55] (03PS2) 10Volans: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) [16:29:57] (03PS2) 10Volans: cumin::cloud_target: add a new profile [puppet] - 10https://gerrit.wikimedia.org/r/867170 (https://phabricator.wikimedia.org/T319401) [16:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221212T1630). Please do the needful. [16:31:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::placement: add explicit firewall rules for haproxy_nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863854 (owner: 10Majavah) [16:32:44] (03PS14) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [16:32:55] (03PS15) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [16:33:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/867200 (https://phabricator.wikimedia.org/T324477) (owner: 10JHathaway) [16:34:53] (03PS7) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [16:36:19] (03PS16) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [16:36:44] (03CR) 10Andrew Bogott: [C: 03+2] rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [16:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:36:57] (03PS1) 10Jcrespo: Ignore the backup check of contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/867206 (https://phabricator.wikimedia.org/T324698) [16:37:12] (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [16:39:15] (03CR) 10Jcrespo: "As suggested on ticket. The other option (other than disabling backups) is to rerun a full backup (which may be wanted as a last run befor" [puppet] - 10https://gerrit.wikimedia.org/r/867206 (https://phabricator.wikimedia.org/T324698) (owner: 10Jcrespo) [16:39:38] (03PS2) 10Jcrespo: bacula: Ignore the backup check of contint1001 jobs [puppet] - 10https://gerrit.wikimedia.org/r/867206 (https://phabricator.wikimedia.org/T324698) [16:39:42] (03PS17) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [16:40:31] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [16:40:31] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:40:47] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [16:40:47] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [16:46:50] (03CR) 10Andrew Bogott: "Seems to still work after the rebase" [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [16:48:02] (03PS18) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [16:48:22] (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [16:48:54] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2047 & mc2048 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/867181 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [16:49:43] (03CR) 10Andrew Bogott: [C: 03+2] remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [16:52:16] (03Abandoned) 10David Caro: acme_chief::server: remove sre-traffic email from timer [puppet] - 10https://gerrit.wikimedia.org/r/788312 (owner: 10David Caro) [16:53:06] (03CR) 10JHathaway: [C: 03+2] Add Muhammad Jaziraly to wmde and nda [puppet] - 10https://gerrit.wikimedia.org/r/867200 (https://phabricator.wikimedia.org/T324477) (owner: 10JHathaway) [16:53:46] (03CR) 10Ottomata: "Couple more nits! Other than that LGTM though we can merge after those." [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [16:53:54] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10jhathaway) 05In progress→03Resolved [16:55:49] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10LSobanski) p:05Triage→03Medium [16:56:10] (03PS19) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [16:57:10] (03PS3) 10Andrew Bogott: Turn on central auth logging for all eqiad1 VMs [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) [16:59:37] (03PS20) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [16:59:40] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Investigate disk errors on wcqs1003.eqiad.wmnet - https://phabricator.wikimedia.org/T323380 (10RobH) a:03Jclark-ctr @bking pinged in DC-ops channel asking about this: > Hey DC Ops, this isn't urgent but wondering if we need to add some tags or so... [17:01:42] (03CR) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [17:11:15] (03CR) 10David Caro: [C: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [17:13:14] (03CR) 10FNegri: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38714/console" [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri) [17:16:46] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10jhathaway) @Varnent apologies for missing this last week, it wasn't on the board we usually work off of and I missed the phabricator ping.... [17:19:56] (03PS1) 10Jaime Nuche: mwdebug_deploy: remove resources from deployment server [puppet] - 10https://gerrit.wikimedia.org/r/867217 [17:20:25] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) [17:20:33] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) [17:20:51] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) [17:24:54] (03CR) 10Southparkfan: [C: 03+1] Turn on central auth logging for all eqiad1 VMs [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [17:25:02] (03PS1) 10Jaime Nuche: mwdebug_deploy: remove configuration [puppet] - 10https://gerrit.wikimedia.org/r/867221 [17:25:37] (03CR) 10Michael Große: "So, `child-src` not only applies to