[00:46:28] RECOVERY - cassandra-a service on restbase2011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:53:38] PROBLEM - cassandra-a service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:15:50] RECOVERY - cassandra-b service on restbase2011 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:22:48] PROBLEM - cassandra-b service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:37:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitaly,sidekiq} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:39:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:46:18] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:48:32] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:56:36] PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 6.968e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:16:20] RECOVERY - cassandra-a service on restbase2011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:23:12] PROBLEM - cassandra-a service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:49:46] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:52] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.088e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:55:40] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:44:32] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:32] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:20:36] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [05:23:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [05:42:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:42:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [05:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1116.eqiad.wmnet with reason: Maintenance [05:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [05:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1104.eqiad.wmnet with reason: Maintenance [05:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T285149)', diff saved to https://phabricator.wikimedia.org/P18977 and previous config saved to /var/cache/conftool/dbconfig/20220124-054218-marostegui.json [05:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:23] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [05:43:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1029 for reimage T299741', diff saved to https://phabricator.wikimedia.org/P18978 and previous config saved to /var/cache/conftool/dbconfig/20220124-054349-marostegui.json [05:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:53] T299741: Upgrade es1 to Bullseye - https://phabricator.wikimedia.org/T299741 [05:44:45] (03PS1) 10Marostegui: es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756192 (https://phabricator.wikimedia.org/T299741) [05:45:50] (03CR) 10Marostegui: [C: 03+2] es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756192 (https://phabricator.wikimedia.org/T299741) (owner: 10Marostegui) [05:49:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T285149)', diff saved to https://phabricator.wikimedia.org/P18979 and previous config saved to /var/cache/conftool/dbconfig/20220124-054926-marostegui.json [05:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:31] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [05:52:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1029.eqiad.wmnet with OS bullseye [05:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:51] (03PS1) 10Marostegui: add_gb_by_central_id_T299827.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756346 (https://phabricator.wikimedia.org/T299827) [06:02:03] (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756411 [06:02:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 T299123', diff saved to https://phabricator.wikimedia.org/P18980 and previous config saved to /var/cache/conftool/dbconfig/20220124-060248-marostegui.json [06:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:53] T299123: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 [06:04:03] (03CR) 10Marostegui: [C: 03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756411 (owner: 10Marostegui) [06:04:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P18981 and previous config saved to /var/cache/conftool/dbconfig/20220124-060431-marostegui.json [06:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bullseye [06:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:14] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye [06:19:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P18982 and previous config saved to /var/cache/conftool/dbconfig/20220124-061936-marostegui.json [06:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1029.eqiad.wmnet with OS bullseye [06:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:26] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T285149)', diff saved to https://phabricator.wikimedia.org/P18983 and previous config saved to /var/cache/conftool/dbconfig/20220124-063440-marostegui.json [06:34:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [06:34:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1178.eqiad.wmnet with reason: Maintenance [06:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:45] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [06:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T285149)', diff saved to https://phabricator.wikimedia.org/P18984 and previous config saved to /var/cache/conftool/dbconfig/20220124-063448-marostegui.json [06:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1022.eqiad.wmnet with OS bullseye [06:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:52] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye completed: - es1022 (**WARN**) - Downtimed on Icinga - Disabled Puppet... [06:50:23] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/756064 (https://phabricator.wikimedia.org/T298989) (owner: 10Accraze) [07:00:52] (03CR) 10Legoktm: [C: 03+1] "LGTM, will leave open for a few days in case anyone else has comments." [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [07:04:10] (03CR) 10Legoktm: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33388/console" [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [07:05:57] (03PS10) 10Legoktm: mediawiki::maintenance: Run recountCategories.php monthly on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [07:08:19] (03PS11) 10Legoktm: mediawiki::maintenance: Run recountCategories.php monthly on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [07:09:30] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33390/console" [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [07:10:47] (03CR) 10Legoktm: [V: 03+1 C: 03+1] "PCC lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [07:16:10] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) a:05Marostegui→03Cmjohnson @Cmjohnson the host keeps ignoring PXE boot even if it attempts to do so from the boot menu. Not sure what could be root cause for this. It only works if selected manu... [07:16:38] 10SRE, 10Data-Engineering: Allow kafka brokers to reload the TLS keystore - https://phabricator.wikimedia.org/T299409 (10elukey) Tried to reload the keystore on a couple of test brokers since the first warnings for tls cert expiry came up in icinga, but it doesn't seem to work. On the server.log I see stuff li... [07:16:42] RECOVERY - cassandra-b service on restbase2011 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:20:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18985 and previous config saved to /var/cache/conftool/dbconfig/20220124-072035-root.json [07:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:39] (03PS1) 10Marostegui: Revert "es1029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756078 [07:23:18] PROBLEM - cassandra-b service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:23:52] (03CR) 10Marostegui: [C: 03+2] Revert "es1029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756078 (owner: 10Marostegui) [07:35:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T285149)', diff saved to https://phabricator.wikimedia.org/P18986 and previous config saved to /var/cache/conftool/dbconfig/20220124-073507-marostegui.json [07:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:12] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:35:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18987 and previous config saved to /var/cache/conftool/dbconfig/20220124-073539-root.json [07:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P18988 and previous config saved to /var/cache/conftool/dbconfig/20220124-075012-marostegui.json [07:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18989 and previous config saved to /var/cache/conftool/dbconfig/20220124-075043-root.json [07:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:42] (03PS1) 10Marostegui: Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756079 [07:55:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18990 and previous config saved to /var/cache/conftool/dbconfig/20220124-075536-root.json [07:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:56] (03CR) 10Marostegui: [C: 03+2] Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756079 (owner: 10Marostegui) [07:58:16] (03PS1) 10Elukey: Add a kafka_11 profile to the PKI Kafka Intermediate settings [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) [08:00:40] (03PS2) 10Elukey: Add a kafka_11 profile to the PKI Kafka Intermediate settings [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) [08:02:02] (03PS2) 10Muehlenhoff: Make ganeti1026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755975 [08:03:34] (03CR) 10Elukey: "The pcc run is not working because I haven't modified the private fake puppet repo. If the change is ok for everybody I'll do it and run p" [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [08:05:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P18991 and previous config saved to /var/cache/conftool/dbconfig/20220124-080517-marostegui.json [08:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:25] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1026 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/755975 (owner: 10Muehlenhoff) [08:05:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18992 and previous config saved to /var/cache/conftool/dbconfig/20220124-080546-root.json [08:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18993 and previous config saved to /var/cache/conftool/dbconfig/20220124-081040-root.json [08:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:38] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10ArielGlenn) This task seems to have stalled after crusnov's departure; is someone else expecting to pick it u... [08:12:20] RECOVERY - Check systemd state on zookeeper-test1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:34] !log draining instances off ganeti1014 for reimage [08:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:24] RECOVERY - cassandra-a service on restbase2011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:18:50] (03PS1) 10Elukey: api-gateway: allow TLS conns to PKI based TLS backends [deployment-charts] - 10https://gerrit.wikimedia.org/r/756524 (https://phabricator.wikimedia.org/T299550) [08:19:52] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:20:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T285149)', diff saved to https://phabricator.wikimedia.org/P18994 and previous config saved to /var/cache/conftool/dbconfig/20220124-082022-marostegui.json [08:20:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [08:20:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [08:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:27] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [08:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T285149)', diff saved to https://phabricator.wikimedia.org/P18995 and previous config saved to /var/cache/conftool/dbconfig/20220124-082029-marostegui.json [08:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18996 and previous config saved to /var/cache/conftool/dbconfig/20220124-082050-root.json [08:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T285149)', diff saved to https://phabricator.wikimedia.org/P18997 and previous config saved to /var/cache/conftool/dbconfig/20220124-082135-marostegui.json [08:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:54] PROBLEM - cassandra-a service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:23:33] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:07] (03CR) 10JMeybohm: Add basic ingress support to chart common_templates (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:25:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18998 and previous config saved to /var/cache/conftool/dbconfig/20220124-082543-root.json [08:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:29] (03PS2) 10Giuseppe Lavagetto: _tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) [08:29:53] (03CR) 10JMeybohm: [C: 03+1] Add a kafka_11 profile to the PKI Kafka Intermediate settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [08:35:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P18999 and previous config saved to /var/cache/conftool/dbconfig/20220124-083554-root.json [08:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P19000 and previous config saved to /var/cache/conftool/dbconfig/20220124-083640-marostegui.json [08:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:47] (03PS3) 10Juan90264: Disable RelatedArticles on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756081 (https://phabricator.wikimedia.org/T299873) [08:38:36] (03CR) 10Filippo Giunchedi: [C: 03+1] elasticsearch: write curator logs to stdout [puppet] - 10https://gerrit.wikimedia.org/r/756053 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [08:40:06] (03PS1) 104nn1l2: commonswiki: Change data.nhm.ac.uk to *.nhm.ac.uk in the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756525 (https://phabricator.wikimedia.org/T299579) [08:40:43] (03PS5) 10Juan90264: Create Draft namespace for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755413 (https://phabricator.wikimedia.org/T299224) [08:40:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19001 and previous config saved to /var/cache/conftool/dbconfig/20220124-084047-root.json [08:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [08:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [08:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:56] (03CR) 10Filippo Giunchedi: [C: 03+2] Add prometheus[12]00[56] to prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/755708 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:50:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1026.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [08:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19002 and previous config saved to /var/cache/conftool/dbconfig/20220124-085057-root.json [08:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P19003 and previous config saved to /var/cache/conftool/dbconfig/20220124-085144-marostegui.json [08:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1026.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [08:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:58] (03PS3) 10Hashar: gerrit: port our theme to JavaScript [puppet] - 10https://gerrit.wikimedia.org/r/756111 (https://phabricator.wikimedia.org/T299877) [08:53:34] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add host-specific Prometheus data [puppet] - 10https://gerrit.wikimedia.org/r/755711 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:55:37] (03CR) 10Hashar: "Fun fact, despite the revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/678700 the gerrit-theme.js is still around. We have sinc" [puppet] - 10https://gerrit.wikimedia.org/r/678646 (owner: 10Paladox) [08:55:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19004 and previous config saved to /var/cache/conftool/dbconfig/20220124-085551-root.json [08:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:09] PROBLEM - Check systemd state on kubernetes2008 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:43] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:06:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19005 and previous config saved to /var/cache/conftool/dbconfig/20220124-090601-root.json [09:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:08] I'll check the ferm failure, I've added new prometheus hosts and that triggered a ferm reload fleetwide [09:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T285149)', diff saved to https://phabricator.wikimedia.org/P19006 and previous config saved to /var/cache/conftool/dbconfig/20220124-090649-marostegui.json [09:06:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:06:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1111.eqiad.wmnet with reason: Maintenance [09:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T285149)', diff saved to https://phabricator.wikimedia.org/P19007 and previous config saved to /var/cache/conftool/dbconfig/20220124-090657-marostegui.json [09:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:01] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:07:41] RECOVERY - Check systemd state on kubernetes2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T285149)', diff saved to https://phabricator.wikimedia.org/P19008 and previous config saved to /var/cache/conftool/dbconfig/20220124-090803-marostegui.json [09:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19009 and previous config saved to /var/cache/conftool/dbconfig/20220124-091054-root.json [09:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:43] (03PS4) 10Hashar: gerrit: Convert gerrit-theme to Polymer 3 [puppet] - 10https://gerrit.wikimedia.org/r/756111 (https://phabricator.wikimedia.org/T299877) [09:21:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19010 and previous config saved to /var/cache/conftool/dbconfig/20220124-092105-root.json [09:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:01] (03PS1) 10Muehlenhoff: Also rename otrs Cumin alias to vrts [puppet] - 10https://gerrit.wikimedia.org/r/756529 (https://phabricator.wikimedia.org/T293942) [09:23:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P19011 and previous config saved to /var/cache/conftool/dbconfig/20220124-092307-marostegui.json [09:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19012 and previous config saved to /var/cache/conftool/dbconfig/20220124-092558-root.json [09:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:54] (03PS1) 10Vgutierrez: site: Reimage cp2040 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/756531 (https://phabricator.wikimedia.org/T271421) [09:30:27] !log depool cp2040 to be reimaged as cache::upload_envoy - T271421 [09:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:31] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [09:32:15] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2040 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/756531 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:34:08] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) @ArielGlenn ideally the service owners, that surely know better what could be the effect of adding AA... [09:36:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19013 and previous config saved to /var/cache/conftool/dbconfig/20220124-093608-root.json [09:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:31] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2040.codfw.wmnet with OS buster [09:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:41] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2040.codfw.wmnet with OS buster [09:37:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [09:38:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P19014 and previous config saved to /var/cache/conftool/dbconfig/20220124-093812-marostegui.json [09:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:23] (03CR) 10Volans: [C: 03+1] "LGTM, nit improvement inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/756006 (owner: 10Muehlenhoff) [09:41:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19015 and previous config saved to /var/cache/conftool/dbconfig/20220124-094102-root.json [09:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [09:43:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1029 as es1 master T299741', diff saved to https://phabricator.wikimedia.org/P19016 and previous config saved to /var/cache/conftool/dbconfig/20220124-094300-marostegui.json [09:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:04] T299741: Upgrade es1 to Bullseye - https://phabricator.wikimedia.org/T299741 [09:43:59] (03PS1) 10Marostegui: es1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756532 (https://phabricator.wikimedia.org/T299741) [09:45:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1027 T299741', diff saved to https://phabricator.wikimedia.org/P19017 and previous config saved to /var/cache/conftool/dbconfig/20220124-094504-marostegui.json [09:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:00] (03CR) 10Marostegui: [C: 03+2] es1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756532 (https://phabricator.wikimedia.org/T299741) (owner: 10Marostegui) [09:46:21] !log uploaded wmfmariadbpy 0.8.1 to apt.wm.o T299753 [09:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:25] T299753: Deploy wmfmariadbpy 0.8.1 - https://phabricator.wikimedia.org/T299753 [09:46:39] !log Deploying wmfmariadbpy 0.8.1 T299753 [09:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:22] (03PS1) 104nn1l2: commonswiki: Remove 'mojnews.com' from the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756533 (https://phabricator.wikimedia.org/T299881) [09:49:09] 10ops-codfw: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 (10Marostegui) [09:50:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1027.eqiad.wmnet with OS bullseye [09:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T285149)', diff saved to https://phabricator.wikimedia.org/P19018 and previous config saved to /var/cache/conftool/dbconfig/20220124-095317-marostegui.json [09:53:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [09:53:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1114.eqiad.wmnet with reason: Maintenance [09:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:21] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T285149)', diff saved to https://phabricator.wikimedia.org/P19019 and previous config saved to /var/cache/conftool/dbconfig/20220124-095324-marostegui.json [09:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T285149)', diff saved to https://phabricator.wikimedia.org/P19020 and previous config saved to /var/cache/conftool/dbconfig/20220124-095430-marostegui.json [09:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:56] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10ArielGlenn) When I look at the netbox entries for the dumpdata and snapshot hosts, they all show ipv6 address... [09:56:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19021 and previous config saved to /var/cache/conftool/dbconfig/20220124-095605-root.json [09:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:33] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) @ArielGlenn Since we introduced Netbox as source of truth when provisioning a new host both primary I... [10:07:36] (03CR) 10Ladsgroup: [C: 04-1] add_gb_by_central_id_T299827.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756346 (https://phabricator.wikimedia.org/T299827) (owner: 10Marostegui) [10:09:28] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10ArielGlenn) Ah rats, I was hoping against hope that the dns records at least for the snaps had been added bef... [10:09:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P19022 and previous config saved to /var/cache/conftool/dbconfig/20220124-100935-marostegui.json [10:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:31] (03PS1) 10Kormat: switchdc: Remove 09-update-tendril [cookbooks] - 10https://gerrit.wikimedia.org/r/756535 (https://phabricator.wikimedia.org/T297605) [10:15:57] !log pool cp2040 using envoy as TLS termination layer - T271421 [10:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:02] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [10:16:37] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [10:17:25] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2040.codfw.wmnet with OS buster [10:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:33] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2040.codfw.wmnet with OS buster completed: - cp2040 (**PASS*... [10:18:08] (03CR) 10Volans: [C: 03+1] "LGTM, make sure to update wikitech too:" [cookbooks] - 10https://gerrit.wikimedia.org/r/756535 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [10:18:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1027.eqiad.wmnet with OS bullseye [10:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:41] (03PS1) 10Marostegui: Revert "es1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756082 [10:19:53] (03PS3) 10Jbond: P:installserver::proxy: switch access logs to syslog [puppet] - 10https://gerrit.wikimedia.org/r/754520 (https://phabricator.wikimedia.org/T298087) [10:20:08] (03CR) 10Jbond: P:installserver::proxy: switch access logs to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754520 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [10:20:22] (03CR) 10Marostegui: [C: 03+2] Revert "es1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756082 (owner: 10Marostegui) [10:20:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19023 and previous config saved to /var/cache/conftool/dbconfig/20220124-102037-root.json [10:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:33] (03CR) 10Kormat: [C: 03+2] switchdc: Remove 09-update-tendril (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/756535 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [10:22:05] (03PS2) 10Marostegui: add_gb_by_central_id_T299827.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756346 (https://phabricator.wikimedia.org/T299827) [10:22:19] (03CR) 10Marostegui: add_gb_by_central_id_T299827.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756346 (https://phabricator.wikimedia.org/T299827) (owner: 10Marostegui) [10:22:57] (03CR) 10Ladsgroup: [C: 03+1] add_gb_by_central_id_T299827.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756346 (https://phabricator.wikimedia.org/T299827) (owner: 10Marostegui) [10:23:20] (03CR) 10Marostegui: [V: 03+2 C: 03+2] add_gb_by_central_id_T299827.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756346 (https://phabricator.wikimedia.org/T299827) (owner: 10Marostegui) [10:24:13] (03Merged) 10jenkins-bot: switchdc: Remove 09-update-tendril [cookbooks] - 10https://gerrit.wikimedia.org/r/756535 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [10:24:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P19024 and previous config saved to /var/cache/conftool/dbconfig/20220124-102440-marostegui.json [10:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:27] (03PS1) 10Marostegui: es2026,2031,2033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756537 (https://phabricator.wikimedia.org/T299889) [10:32:30] (03PS1) 10Jbond: P:base::firewall: Add proemethous hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756538 (https://phabricator.wikimedia.org/T291946) [10:33:14] (03CR) 10Marostegui: [C: 03+2] es2026,2031,2033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756537 (https://phabricator.wikimedia.org/T299889) (owner: 10Marostegui) [10:33:52] (03CR) 10Jbond: puppetdb-api: allow prometheus_nodes via ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755982 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:34:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2031.codfw.wmnet with OS bullseye [10:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19025 and previous config saved to /var/cache/conftool/dbconfig/20220124-103540-root.json [10:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2033.codfw.wmnet with OS bullseye [10:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T285149)', diff saved to https://phabricator.wikimedia.org/P19026 and previous config saved to /var/cache/conftool/dbconfig/20220124-103945-marostegui.json [10:39:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:39:49] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:39:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T285149)', diff saved to https://phabricator.wikimedia.org/P19027 and previous config saved to /var/cache/conftool/dbconfig/20220124-103958-marostegui.json [10:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:02] (03CR) 10Jbond: [C: 03+1] "lgtm but see nits/comments" [software/statograph] - 10https://gerrit.wikimedia.org/r/756041 (https://phabricator.wikimedia.org/T298619) (owner: 10CDanis) [10:41:55] (03Abandoned) 10Ladsgroup: Avoid double parsing [extensions/FlaggedRevs] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/755406 (https://phabricator.wikimedia.org/T292300) (owner: 10Ladsgroup) [10:45:16] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] puppetdb-api: allow prometheus_nodes via ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755982 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:47:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm favor of this, and will help with ditching prometheus_nodes across the codebase (cfr https://phabricator.wikimedia.org/T207292)" [puppet] - 10https://gerrit.wikimedia.org/r/756538 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [10:49:35] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1022.eqiad.wmnet with OS buster [10:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:18] (03CR) 10Hashar: "I have removed my attempt and instead borrowed the code Paladox wrote back in April 2021: https://gerrit.wikimedia.org/r/c/operations/pupp" [puppet] - 10https://gerrit.wikimedia.org/r/756111 (https://phabricator.wikimedia.org/T299877) (owner: 10Hashar) [10:50:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19028 and previous config saved to /var/cache/conftool/dbconfig/20220124-105044-root.json [10:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:05] (03CR) 10Jbond: [C: 03+1] "shame but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [10:51:43] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2031.codfw.wmnet with OS bullseye [10:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:03] 10SRE, 10Infrastructure-Foundations: Updated java.security policy in OpenJDK 11.0.4 - https://phabricator.wikimedia.org/T299894 (10MoritzMuehlenhoff) [10:53:10] 10SRE, 10Infrastructure-Foundations: Updated java.security policy in OpenJDK 11.0.4 - https://phabricator.wikimedia.org/T299894 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:54:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2031.codfw.wmnet with OS bullseye [10:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:34] (03CR) 10Jbond: [C: 03+1] Add a kafka_11 profile to the PKI Kafka Intermediate settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [10:54:56] (03PS1) 10Vgutierrez: site: Reimage cp1088 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/756541 (https://phabricator.wikimedia.org/T271421) [10:56:32] !log installing modsecurity-apache security updates [10:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:35] (03PS2) 10Jbond: P:base::firewall: Add proemethous hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756538 (https://phabricator.wikimedia.org/T291946) [10:58:42] !log depool cp1088 to be reimaged as cache::upload_envoy - T271421 [10:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:46] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [10:58:49] (03PS1) 10Filippo Giunchedi: hieradata: expect 404 when probing puppetdb-api/ [puppet] - 10https://gerrit.wikimedia.org/r/756542 (https://phabricator.wikimedia.org/T291946) [10:59:16] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1022.eqiad.wmnet with OS buster [10:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:32] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1088 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/756541 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:59:40] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [10:59:49] (03CR) 10Elukey: Add a kafka_11 profile to the PKI Kafka Intermediate settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [11:00:21] (03CR) 10Jbond: P:base::firewall: Add proemethous hosts to catch all ferm rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756538 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [11:00:22] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: expect 404 when probing puppetdb-api/ [puppet] - 10https://gerrit.wikimedia.org/r/756542 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:00:28] (03PS2) 10Filippo Giunchedi: hieradata: expect 404 when probing puppetdb-api/ [puppet] - 10https://gerrit.wikimedia.org/r/756542 (https://phabricator.wikimedia.org/T291946) [11:02:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1022.eqiad.wmnet [11:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [11:03:49] (03CR) 10Jbond: Add a kafka_11 profile to the PKI Kafka Intermediate settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [11:03:51] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:03:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1088.eqiad.wmnet with OS buster [11:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] add systemd timer for Enterprise HTML dumps download and rsync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [11:04:04] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1088.eqiad.wmnet with OS buster [11:04:56] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1023.eqiad.wmnet with OS buster [11:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19030 and previous config saved to /var/cache/conftool/dbconfig/20220124-110548-root.json [11:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove kea, nod, and sms from wmfGetVariantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749889 (https://phabricator.wikimedia.org/T299304) (owner: 10Amire80) [11:06:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2033.codfw.wmnet with OS bullseye [11:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:40] (03PS3) 10Elukey: Add a kafka_11 profile to the PKI Kafka Intermediate settings [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) [11:07:58] (03CR) 10Elukey: Add a kafka_11 profile to the PKI Kafka Intermediate settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [11:08:35] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (9) node(s) change every puppet run: restbase2020, restbase1019, miscweb1002, restbase1020, build2001, restbase2019, restbase1021, wdqs1010, restbase2011 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:16:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33393/console" [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [11:17:03] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1023.eqiad.wmnet with OS buster [11:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1023.eqiad.wmnet [11:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:34] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [11:19:59] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1024.eqiad.wmnet with OS buster [11:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19031 and previous config saved to /var/cache/conftool/dbconfig/20220124-112051-root.json [11:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T285149)', diff saved to https://phabricator.wikimedia.org/P19032 and previous config saved to /var/cache/conftool/dbconfig/20220124-112113-marostegui.json [11:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:17] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [11:26:55] (03PS4) 10Elukey: Add a kafka_11 profile to the PKI Kafka Intermediate settings [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) [11:27:00] (03PS1) 10Ladsgroup: Use MainStash instead of db-replicated [extensions/AbuseFilter] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756083 (https://phabricator.wikimedia.org/T272512) [11:27:20] jouncebot: nowandnext [11:27:20] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [11:27:20] In 0 hour(s) and 32 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T1200) [11:27:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2031.codfw.wmnet with OS bullseye [11:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [11:31:25] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1024.eqiad.wmnet with OS buster [11:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:32] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1024.eqiad.wmnet with OS buster [11:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:04] (03CR) 10Elukey: [C: 03+2] Add a kafka_11 profile to the PKI Kafka Intermediate settings [puppet] - 10https://gerrit.wikimedia.org/r/756522 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [11:33:34] * addshore tries to remember to be here for his deploy window patch [11:34:13] (03CR) 10ArielGlenn: add systemd timer for Enterprise HTML dumps download and rsync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [11:35:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19033 and previous config saved to /var/cache/conftool/dbconfig/20220124-113555-root.json [11:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P19034 and previous config saved to /var/cache/conftool/dbconfig/20220124-113618-marostegui.json [11:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:48] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [11:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2026.codfw.wmnet with OS bullseye [11:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:00] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:41:37] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [11:45:51] (03PS1) 10JMeybohm: Make a bundle signer return it's root CA [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 [11:48:52] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [11:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:53] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1088.eqiad.wmnet with OS buster [11:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:01] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1088.eqiad.wmnet with OS buster completed: - cp1088 (**WARN*... [11:50:36] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1024.eqiad.wmnet with OS buster [11:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:46] !log pool cp1088 using envoy as TLS termination layer - T271421 [11:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:49] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [11:50:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19035 and previous config saved to /var/cache/conftool/dbconfig/20220124-115059-root.json [11:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:08] (03PS1) 10Elukey: nagios: update settings for ssl_kafka [puppet] - 10https://gerrit.wikimedia.org/r/756548 (https://phabricator.wikimedia.org/T299409) [11:51:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P19036 and previous config saved to /var/cache/conftool/dbconfig/20220124-115123-marostegui.json [11:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:46] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [11:52:13] (03CR) 10JMeybohm: [C: 04-1] Make a bundle signer return it's root CA (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (owner: 10JMeybohm) [11:52:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove special groups from s8 codfw T263127', diff saved to https://phabricator.wikimedia.org/P19037 and previous config saved to /var/cache/conftool/dbconfig/20220124-115236-marostegui.json [11:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:41] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [11:53:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove contributions from s8 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19038 and previous config saved to /var/cache/conftool/dbconfig/20220124-115334-marostegui.json [11:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:29] (03CR) 10Elukey: [C: 03+2] nagios: update settings for ssl_kafka [puppet] - 10https://gerrit.wikimedia.org/r/756548 (https://phabricator.wikimedia.org/T299409) (owner: 10Elukey) [11:56:28] (03PS1) 10Elukey: Revert "nagios: update settings for ssl_kafka" [puppet] - 10https://gerrit.wikimedia.org/r/756084 [11:56:43] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [11:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:33] (03CR) 10Elukey: "I thought to extend a bit the time window but it doesn't really work with the PKI timings. 7/3 days should be ok for the moment." [puppet] - 10https://gerrit.wikimedia.org/r/756084 (owner: 10Elukey) [11:57:52] (03Abandoned) 10Elukey: Revert "nagios: update settings for ssl_kafka" [puppet] - 10https://gerrit.wikimedia.org/r/756084 (owner: 10Elukey) [11:58:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [11:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:19] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1024.eqiad.wmnet with OS buster [11:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T1200). [12:00:04] aharoni, addshore, and nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:09] hi [12:00:19] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1025.eqiad.wmnet with OS buster [12:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:22] hi [12:00:35] hi [12:01:05] o/ [12:01:07] hi aharoni & addshore [12:01:12] i can deploy today [12:01:17] epic! [12:01:20] (sorry for breaking the “hi” combo) [12:01:25] Shalom from the Jerusalem [12:01:29] Shalom from Jerusalem [12:01:35] (03CR) 10Urbanecm: [C: 03+2] fawiki: Remove move-rootuserpages flag from users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756150 (https://phabricator.wikimedia.org/T299847) (owner: 104nn1l2) [12:01:49] aharoni: will you be able to test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/749889? [12:01:57] Yes. [12:02:22] (03Merged) 10jenkins-bot: fawiki: Remove move-rootuserpages flag from users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756150 (https://phabricator.wikimedia.org/T299847) (owner: 104nn1l2) [12:02:38] aharoni: great, thanks [12:02:49] will ping once ready [12:03:21] nn1l2: please test your first patch at mwdebug1001 [12:03:27] ok [12:04:03] LGTM [12:04:07] syncing [12:04:54] (03PS5) 10Urbanecm: Remove kea, nod, and sms from wmfGetVariantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749889 (https://phabricator.wikimedia.org/T299304) (owner: 10Amire80) [12:04:58] (03CR) 10Urbanecm: [C: 03+2] Remove kea, nod, and sms from wmfGetVariantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749889 (https://phabricator.wikimedia.org/T299304) (owner: 10Amire80) [12:05:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1a463610ba2a92f7437c6921a9591616de0d242e: fawiki: Remove move-rootuserpages flag from users (T299847) (duration: 00m 49s) [12:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:28] T299847: Remove move-rootuserpages flag from users on Farsi Wikipedia - https://phabricator.wikimedia.org/T299847 [12:05:30] nn1l2: first patch is live [12:05:44] (03Merged) 10jenkins-bot: Remove kea, nod, and sms from wmfGetVariantSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749889 (https://phabricator.wikimedia.org/T299304) (owner: 10Amire80) [12:06:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19039 and previous config saved to /var/cache/conftool/dbconfig/20220124-120602-root.json [12:06:04] Thanks! It looks good [12:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:08] aharoni: your patch is at mwdebug1001, please have a look. [12:06:14] (03PS2) 10Urbanecm: fawiki: Exempt draft namespace from robots control by users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756152 (https://phabricator.wikimedia.org/T299850) (owner: 104nn1l2) [12:06:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T285149)', diff saved to https://phabricator.wikimedia.org/P19040 and previous config saved to /var/cache/conftool/dbconfig/20220124-120627-marostegui.json [12:06:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [12:06:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1126.eqiad.wmnet with reason: Maintenance [12:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:31] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [12:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T285149)', diff saved to https://phabricator.wikimedia.org/P19041 and previous config saved to /var/cache/conftool/dbconfig/20220124-120635-marostegui.json [12:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:43] addshore: would it be ok if i ping you once done with other patches and let you self-serve? [12:06:47] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1014.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [12:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1014.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [12:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:13] urbanecm: I'm not able to deploy myself right now! [12:07:18] (03PS1) 10Marostegui: Revert "es2026,2031,2033: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756085 [12:07:43] addshore: i see. in that case, can you enlighten me about what the patch does? [12:08:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2026.codfw.wmnet with OS bullseye [12:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:05] urbanecm: looks good [12:08:15] certainly, it just registeres a schema to be accepted by the event logging system [12:08:23] (03CR) 10Marostegui: [C: 03+2] Revert "es2026,2031,2033: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756085 (owner: 10Marostegui) [12:08:25] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1025.eqiad.wmnet with OS buster [12:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:39] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [12:08:48] I believe when deployed it should make it visible on https://meta.wikimedia.org/w/api.php?action=streamconfigs [12:08:48] addshore: i see. So, just a plain sync should work? [12:08:53] urbanecm: yup! [12:09:03] and thanks for the api link, will check it too [12:09:10] thanks aharoni , syncing [12:09:16] ut yes, in general it is a noop, other than making it appear on https://meta.wikimedia.org/w/api.php?action=streamconfigs [12:09:18] *but [12:09:53] (03CR) 10Urbanecm: [C: 03+2] fawiki: Exempt draft namespace from robots control by users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756152 (https://phabricator.wikimedia.org/T299850) (owner: 104nn1l2) [12:10:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 97d047bc4bf3748fc76f63647d77d26cc545b49f: Remove kea, nod, and sms from wmfGetVariantSettings (T299304; T296286; T298075; T298182) (duration: 00m 49s) [12:10:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1033 T299889', diff saved to https://phabricator.wikimedia.org/P19042 and previous config saved to /var/cache/conftool/dbconfig/20220124-121029-marostegui.json [12:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:33] T298182: Add Kabuverdianu (kea) to Names.php - https://phabricator.wikimedia.org/T298182 [12:10:33] T296286: Add Skolt Sami (sms) to Names.php - https://phabricator.wikimedia.org/T296286 [12:10:34] T298075: Add Northern Thai (nod) to Names.php - https://phabricator.wikimedia.org/T298075 [12:10:34] aharoni: and, live [12:10:34] T299304: Remove kea, nod, and sms from wmgExtraLanguageNames on Wikimedia configuration - https://phabricator.wikimedia.org/T299304 [12:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:37] T299889: Upgrade es2 to Bullseye - https://phabricator.wikimedia.org/T299889 [12:10:37] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase102[45].eqiad.wmnet [12:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:39] (03Merged) 10jenkins-bot: fawiki: Exempt draft namespace from robots control by users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756152 (https://phabricator.wikimedia.org/T299850) (owner: 104nn1l2) [12:11:05] nn1l2: your second patch is at mwdebug1001, can you test? [12:11:12] (03PS2) 10Urbanecm: commonswiki: Change data.nhm.ac.uk to *.nhm.ac.uk in the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756525 (https://phabricator.wikimedia.org/T299579) (owner: 104nn1l2) [12:11:16] (03PS1) 10Marostegui: es1033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756550 (https://phabricator.wikimedia.org/T299889) [12:11:17] urbanecm: thanks! [12:11:18] (03PS2) 10Urbanecm: commonswiki: Remove 'mojnews.com' from the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756533 (https://phabricator.wikimedia.org/T299881) (owner: 104nn1l2) [12:11:21] ok [12:11:22] (03CR) 10Urbanecm: [C: 03+2] commonswiki: Change data.nhm.ac.uk to *.nhm.ac.uk in the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756525 (https://phabricator.wikimedia.org/T299579) (owner: 104nn1l2) [12:11:25] (03CR) 10Urbanecm: [C: 03+2] commonswiki: Remove 'mojnews.com' from the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756533 (https://phabricator.wikimedia.org/T299881) (owner: 104nn1l2) [12:11:34] aharoni: np :) [12:11:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:11:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:56] (03CR) 10Marostegui: [C: 03+2] es1033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756550 (https://phabricator.wikimedia.org/T299889) (owner: 10Marostegui) [12:12:11] (03Merged) 10jenkins-bot: commonswiki: Change data.nhm.ac.uk to *.nhm.ac.uk in the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756525 (https://phabricator.wikimedia.org/T299579) (owner: 104nn1l2) [12:12:15] (03Merged) 10jenkins-bot: commonswiki: Remove 'mojnews.com' from the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756533 (https://phabricator.wikimedia.org/T299881) (owner: 104nn1l2) [12:12:32] LGTM: : view-source:https://fa.wikipedia.org/wiki/%D9%BE%DB%8C%D8%B4%E2%80%8C%D9%86%D9%88%DB%8C%D8%B3:%D8%AA%D8%B3%D8%AA [12:13:06] excellent [12:13:08] syncing [12:13:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:24] 10SRE, 10Data-Engineering: Allow kafka brokers to reload the TLS keystore - https://phabricator.wikimedia.org/T299409 (10elukey) 05Open→03Resolved a:03elukey It seems that our kafka version, 1.1, doesn't support well this use case. The kafka intermediate PKI CA now issues cert with 1y of validity, to red... [12:13:31] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [12:13:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1033.eqiad.wmnet with OS bullseye [12:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:29] nn1l2: second patch live [12:14:35] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1026.eqiad.wmnet with OS buster [12:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:36] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [12:15:40] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1014 [12:16:06] except the sync cmd hangs for a while [12:16:47] Thanks! It looks good too [12:18:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:13] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2c7b45a3f080757338a877f4024e27dea8cc47c5: fawiki: Exempt draft namespace from robots control by users (T299850) (duration: 05m 39s) [12:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:17] T299850: Exempt draft namespace on Farsi Wikipedia from robots control by users - https://phabricator.wikimedia.org/T299850 [12:19:24] finally [12:19:47] nn1l2: both allowlist patches are at mwdebug1001 now [12:19:50] can you test? [12:20:01] ok [12:20:03] (03PS2) 10Urbanecm: Add mwcli.command_execute to wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755794 (https://phabricator.wikimedia.org/T293583) (owner: 10Addshore) [12:20:06] (03CR) 10Urbanecm: [C: 03+2] Add mwcli.command_execute to wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755794 (https://phabricator.wikimedia.org/T293583) (owner: 10Addshore) [12:20:57] (03Merged) 10jenkins-bot: Add mwcli.command_execute to wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755794 (https://phabricator.wikimedia.org/T293583) (owner: 10Addshore) [12:21:01] LGTM, upload successful: https://commons.wikimedia.org/wiki/File:Richardia_telescopica_Gerstaecker,_1860.jpg [12:21:05] !log installing ICU security updates on stretch [12:21:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19043 and previous config saved to /var/cache/conftool/dbconfig/20220124-122106-root.json [12:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:27] thanks, syncing [12:21:35] let's hope it will be faster this time [12:22:18] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: db340cc: 5424d69: Update wgCopyUploadsDomains allowlist (T299579, T299881) (duration: 00m 48s) [12:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:23] good [12:22:24] T299579: Add *.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T299579 [12:22:24] T299881: Remove redundant mojnews.com from the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T299881 [12:22:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:22:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:47] addshore: pulled your patch to mwdebug1001 in case you want to have a look [12:22:50] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1026.eqiad.wmnet with OS buster [12:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:05] i see mwcli.command_execute in https://meta.wikimedia.org/w/api.php?action=streamconfigs, so i think it works [12:23:08] urbanecm: looks good to me [12:23:11] syncing [12:23:15] ty [12:24:21] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 296fe1644a2a71914e880f3562f8e32fd66c1637: Add mwcli.command_execute to wgEventStreams (T293583) (duration: 00m 48s) [12:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:25] T293583: [mwcli] reporting on usage - https://phabricator.wikimedia.org/T293583 [12:24:26] and, we're live [12:24:48] and done [12:24:51] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1026.eqiad.wmnet [12:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:09] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1027.eqiad.wmnet with OS buster [12:25:10] !log UTC morning B&C done [12:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:34] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [12:27:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:40] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [12:36:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19044 and previous config saved to /var/cache/conftool/dbconfig/20220124-123609-root.json [12:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:36:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:00] /7 [12:37:02] nope [12:37:02] :D [12:38:58] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [12:38:58] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [12:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:09] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [12:39:09] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [12:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:20] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1027.eqiad.wmnet [12:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:31] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1027.eqiad.wmnet with OS buster [12:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:06] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1028.eqiad.wmnet with OS buster [12:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:50] (03PS1) 10Marostegui: Revert "es1033: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756568 [12:40:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1033.eqiad.wmnet with OS bullseye [12:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:40] (03CR) 10Marostegui: [C: 03+2] Revert "es1033: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756568 (owner: 10Marostegui) [12:41:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19045 and previous config saved to /var/cache/conftool/dbconfig/20220124-124140-root.json [12:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:38] (03CR) 10Jbond: "see comments i think i may be missing something 😕" [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (owner: 10JMeybohm) [12:55:03] (03PS1) 10Ladsgroup: es2034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756565 (https://phabricator.wikimedia.org/T299911) [12:56:00] (03CR) 10Marostegui: [C: 03+1] es2034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756565 (https://phabricator.wikimedia.org/T299911) (owner: 10Ladsgroup) [12:56:28] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33394/console" [puppet] - 10https://gerrit.wikimedia.org/r/756542 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:56:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19046 and previous config saved to /var/cache/conftool/dbconfig/20220124-125643-root.json [12:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:00] (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Make target for validate_state configurable [cookbooks] - 10https://gerrit.wikimedia.org/r/756006 [12:58:27] (03CR) 10Muehlenhoff: sre.ganeti.addnode: Make target for validate_state configurable (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/756006 (owner: 10Muehlenhoff) [13:00:32] (03PS1) 10Ladsgroup: es2029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756586 (https://phabricator.wikimedia.org/T299911) [13:00:43] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/756006 (owner: 10Muehlenhoff) [13:00:56] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es2034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756565 (https://phabricator.wikimedia.org/T299911) (owner: 10Ladsgroup) [13:03:19] (03PS3) 10Filippo Giunchedi: thanos: move to a single flag to control uploads [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) [13:05:19] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: move to a single flag to control uploads [puppet] - 10https://gerrit.wikimedia.org/r/755712 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [13:06:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: reimage for upgrade - T299911 [13:06:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: reimage for upgrade - T299911 [13:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:04] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [13:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T285149)', diff saved to https://phabricator.wikimedia.org/P19047 and previous config saved to /var/cache/conftool/dbconfig/20220124-130654-marostegui.json [13:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:58] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [13:07:26] (03CR) 10Jbond: [C: 03+2] gerrit: Convert gerrit-theme to Polymer 3 [puppet] - 10https://gerrit.wikimedia.org/r/756111 (https://phabricator.wikimedia.org/T299877) (owner: 10Hashar) [13:08:32] (03PS3) 10Muehlenhoff: sre.ganeti.addnode: Make target for validate_state configurable [cookbooks] - 10https://gerrit.wikimedia.org/r/756006 [13:10:23] (03CR) 10Hashar: "Notice: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/static/gerrit-theme.html]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/756111 (https://phabricator.wikimedia.org/T299877) (owner: 10Hashar) [13:11:35] (03PS3) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: support creating more than 1 instance [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/755707 [13:11:37] (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: create_instance_with_prefix: drop unused project parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/756587 [13:11:39] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack: fix security group functions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/756588 [13:11:41] (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: start_instance_with_prefix: refactor and fix default behavior [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/756589 [13:11:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19048 and previous config saved to /var/cache/conftool/dbconfig/20220124-131147-root.json [13:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:14] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Make target for validate_state configurable [cookbooks] - 10https://gerrit.wikimedia.org/r/756006 (owner: 10Muehlenhoff) [13:12:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] add systemd timer for Enterprise HTML dumps download and rsync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [13:12:54] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (11) node(s) change every puppet run: miscweb1002, restbase1021, restbase2020, restbase2019, wdqs1010, labstore1007, restbase1020, restbase1019, labstore1006, restbase2011, build2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [13:13:18] (03CR) 10Jbond: [C: 03+2] P:base::firewall: Add proemethous hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756538 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [13:13:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es2034.codfw.wmnet with OS bullseye [13:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:44] (03PS1) 10Jbond: Revert "P:base::firewall: Add proemethous hosts to catch all ferm rule" [puppet] - 10https://gerrit.wikimedia.org/r/756569 [13:14:51] fyi i broke puppet sending a fix now [13:14:57] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:base::firewall: Add proemethous hosts to catch all ferm rule" [puppet] - 10https://gerrit.wikimedia.org/r/756569 (owner: 10Jbond) [13:18:42] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.05429 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:19:47] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1028.eqiad.wmnet with OS buster [13:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:02] ^^^ re puppet failure, fix is deployed running puppet on failed nodes now [13:21:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P19049 and previous config saved to /var/cache/conftool/dbconfig/20220124-132159-marostegui.json [13:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:29] (03CR) 10Jbond: [C: 03+2] P:installserver::proxy: switch access logs to syslog [puppet] - 10https://gerrit.wikimedia.org/r/754520 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [13:24:09] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7002 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:26:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1029.eqiad.wmnet with OS buster [13:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19050 and previous config saved to /var/cache/conftool/dbconfig/20220124-132651-root.json [13:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:51] (03CR) 10ArielGlenn: [C: 03+2] "Verified that the new creds work for downloading." [puppet] - 10https://gerrit.wikimedia.org/r/755979 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [13:27:54] (03PS1) 10Filippo Giunchedi: ssl: add search.d.w public key [puppet] - 10https://gerrit.wikimedia.org/r/756593 (https://phabricator.wikimedia.org/T299633) [13:28:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1028.eqiad.wmnet [13:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:55] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002172 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:29:52] (03CR) 10Filippo Giunchedi: "This is a new keypair (i.e. not referenced in configurations). See also task for more context/info" [puppet] - 10https://gerrit.wikimedia.org/r/756593 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [13:30:59] (03PS1) 10Jbond: profile::installserver::proxy: correctly escape syslog line [puppet] - 10https://gerrit.wikimedia.org/r/756594 [13:31:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33395/console" [puppet] - 10https://gerrit.wikimedia.org/r/756594 (owner: 10Jbond) [13:33:03] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7386 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:33:22] (03PS2) 10Jbond: profile::installserver::proxy: correctly escape syslog line [puppet] - 10https://gerrit.wikimedia.org/r/756594 [13:34:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33396/console" [puppet] - 10https://gerrit.wikimedia.org/r/756594 (owner: 10Jbond) [13:34:28] (03CR) 10Jbond: [V: 03+1 C: 03+2] profile::installserver::proxy: correctly escape syslog line [puppet] - 10https://gerrit.wikimedia.org/r/756594 (owner: 10Jbond) [13:35:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] profile::installserver::proxy: correctly escape syslog line [puppet] - 10https://gerrit.wikimedia.org/r/756594 (owner: 10Jbond) [13:37:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P19051 and previous config saved to /var/cache/conftool/dbconfig/20220124-133704-marostegui.json [13:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:16] (03PS1) 10Filippo Giunchedi: cirrus: move to search.d.w cert [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) [13:37:43] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6898 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:39:18] (03CR) 10Emil Chetty: profile::cache::kafka::webrequest: Log Sec-CH-UA* headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [13:41:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19052 and previous config saved to /var/cache/conftool/dbconfig/20220124-134154-root.json [13:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:31] (03PS2) 10Filippo Giunchedi: cirrus: move to search.d.w cert [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) [13:42:52] !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2034.codfw.wmnet with OS bullseye [13:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:29] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7046 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:44:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10MoritzMuehlenhoff) >>! In T299744#7641382, @wiki_willy wrote: > Assigning this to @Cmjohnson. However, I also reached out to @MoritzMuehlenhoff to take... [13:45:30] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33399/console" [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [13:49:13] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7268 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:49:55] (03PS1) 10ArielGlenn: clean up older enterprise html dumps, keep the last 6 runs [puppet] - 10https://gerrit.wikimedia.org/r/756596 (https://phabricator.wikimedia.org/T273585) [13:50:01] !log installing util-linux security updates on bullseye [13:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T285149)', diff saved to https://phabricator.wikimedia.org/P19053 and previous config saved to /var/cache/conftool/dbconfig/20220124-135208-marostegui.json [13:52:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [13:52:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1172.eqiad.wmnet with reason: Maintenance [13:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:13] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [13:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T285149)', diff saved to https://phabricator.wikimedia.org/P19054 and previous config saved to /var/cache/conftool/dbconfig/20220124-135216-marostegui.json [13:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:49] (03PS3) 10Filippo Giunchedi: cirrus: move to search.d.w cert [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) [13:53:13] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7209 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:53:58] (03PS1) 10Jbond: P:installserver::proxy: quote strings [puppet] - 10https://gerrit.wikimedia.org/r/756597 [13:54:20] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:installserver::proxy: quote strings [puppet] - 10https://gerrit.wikimedia.org/r/756597 (owner: 10Jbond) [13:54:45] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 75 probes of 648 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:54:49] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:55:53] 10SRE, 10Data-Engineering, 10Metrics-Platform, 10Traffic, 10Patch-For-Review: VarnishKafka to propagate user agent client hints headers to webrequest - https://phabricator.wikimedia.org/T299401 (10EChetty) [13:56:23] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7120 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:56:56] (03PS5) 10Jbond: P:rsyslog: add squid to the list of programs sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) [13:56:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19055 and previous config saved to /var/cache/conftool/dbconfig/20220124-135658-root.json [13:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:05] (03CR) 10Jbond: [C: 03+2] P:rsyslog: add squid to the list of programs sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/754521 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [13:58:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:00:25] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 648 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:00:29] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6277 MB (17% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:00:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1029.eqiad.wmnet with OS buster [14:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es2034.codfw.wmnet with OS bullseye [14:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:59] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1029.eqiad.wmnet [14:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1030.eqiad.wmnet with OS buster [14:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:09] (03PS4) 10Filippo Giunchedi: cirrus: move to search.d.w cert [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) [14:09:13] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33401/console" [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [14:11:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "See latest PCC for the diff https://puppet-compiler.wmflabs.org/pcc-worker1001/33401/" [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [14:12:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19056 and previous config saved to /var/cache/conftool/dbconfig/20220124-141201-root.json [14:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:07] PROBLEM - Cassandra instance data free space on restbase2018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8076 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:16:27] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7412 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:27:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19057 and previous config saved to /var/cache/conftool/dbconfig/20220124-142705-root.json [14:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:36] (03PS1) 10Filippo Giunchedi: prometheus: filesystem provisioning [puppet] - 10https://gerrit.wikimedia.org/r/756602 (https://phabricator.wikimedia.org/T296199) [14:27:38] (03PS1) 10Filippo Giunchedi: site: add Prometheus role to codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/756603 (https://phabricator.wikimedia.org/T296199) [14:27:40] (03PS1) 10Filippo Giunchedi: site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199) [14:30:38] (03PS1) 10Muehlenhoff: Add library hint for libs shipped by util-linux [puppet] - 10https://gerrit.wikimedia.org/r/756605 [14:32:12] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33402/console" [puppet] - 10https://gerrit.wikimedia.org/r/756603 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [14:33:18] (03PS6) 10Aqu: Airflow: Fix links in error emails [puppet] - 10https://gerrit.wikimedia.org/r/756017 (https://phabricator.wikimedia.org/T299398) [14:34:19] (03CR) 10Filippo Giunchedi: site: add Prometheus role to codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/756603 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [14:34:30] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: filesystem provisioning [puppet] - 10https://gerrit.wikimedia.org/r/756602 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [14:34:32] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libs shipped by util-linux [puppet] - 10https://gerrit.wikimedia.org/r/756605 (owner: 10Muehlenhoff) [14:34:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2034.codfw.wmnet with OS bullseye [14:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:53] moritzm: merged your change too [14:35:05] good timing there [14:35:14] excellent, thanks :-) [14:35:55] sure np! [14:41:26] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 5986 MB (16% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:41:49] (03PS1) 10Filippo Giunchedi: prometheus: disable rsync where not needed [puppet] - 10https://gerrit.wikimedia.org/r/756607 (https://phabricator.wikimedia.org/T296199) [14:42:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19058 and previous config saved to /var/cache/conftool/dbconfig/20220124-144208-root.json [14:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T285149)', diff saved to https://phabricator.wikimedia.org/P19059 and previous config saved to /var/cache/conftool/dbconfig/20220124-144234-marostegui.json [14:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:38] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [14:44:22] (03PS2) 10EJoseph: Upgrade to elasticsearch 6.8.23 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499) [14:44:46] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. [14:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:34] 10SRE, 10Kubernetes, 10discovery-system: Document what #discovery-system is - https://phabricator.wikimedia.org/T282948 (10Joe) @Aklapper yes it used to be used for the system we built that is the base for both the DNS discovery system and dynamic configuration for things like pybal or mediawiki. I think th... [14:46:45] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1030.eqiad.wmnet with OS buster [14:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:32] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7114 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:48:13] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1030.eqiad.wmnet [14:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:49] (03CR) 10Elukey: [C: 03+2] Remove duplicate hiera config for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/755702 (owner: 10Elukey) [14:50:58] (03PS2) 10Elukey: Remove duplicate hiera config for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/755702 [14:51:00] 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Miriam) [14:52:35] (03CR) 10EJoseph: Upgrade to elasticsearch 6.8.23 (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499) (owner: 10EJoseph) [14:53:03] (03PS1) 10Ladsgroup: Revert "es2034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756571 [14:54:03] (03PS1) 10Jbond: O:cacheing-proxy: add support for structured logs [puppet] - 10https://gerrit.wikimedia.org/r/756608 [14:54:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33403/console" [puppet] - 10https://gerrit.wikimedia.org/r/756608 (owner: 10Jbond) [14:56:21] (03CR) 10Elukey: [C: 03+1] "LGTM, will merge when the cert-manager issue is fixed!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/756064 (https://phabricator.wikimedia.org/T298989) (owner: 10Accraze) [14:56:51] (03CR) 10Hnowlan: api-gateway: allow TLS conns to PKI based TLS backends (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/756524 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [14:57:05] (03PS2) 10Ladsgroup: Revert "es2034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756571 [14:57:09] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es2034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756571 (owner: 10Ladsgroup) [14:57:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19060 and previous config saved to /var/cache/conftool/dbconfig/20220124-145712-root.json [14:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P19061 and previous config saved to /var/cache/conftool/dbconfig/20220124-145738-marostegui.json [14:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:34] (03PS1) 10Giuseppe Lavagetto: CI: add complete mock list of service proxy listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/756609 (https://phabricator.wikimedia.org/T291959) [15:00:54] (03CR) 10Elukey: api-gateway: allow TLS conns to PKI based TLS backends (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/756524 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [15:01:36] (03PS2) 10Giuseppe Lavagetto: CI: add complete mock list of service proxy listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/756609 (https://phabricator.wikimedia.org/T291959) [15:04:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [15:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:06] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [15:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/754888 (owner: 10Volans) [15:07:41] (03CR) 10Dzahn: "I had already uploaded https://gerrit.wikimedia.org/r/c/operations/puppet/+/755473/ previously but wanted to let Arnold merge it" [puppet] - 10https://gerrit.wikimedia.org/r/756529 (https://phabricator.wikimedia.org/T293942) (owner: 10Muehlenhoff) [15:08:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install clouddumps100[89] - https://phabricator.wikimedia.org/T299610 (10Andrew) [15:08:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install clouddumps100[89] - https://phabricator.wikimedia.org/T299610 (10Andrew) a:05Andrew→03Jclark-ctr [15:12:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P19062 and previous config saved to /var/cache/conftool/dbconfig/20220124-151243-marostegui.json [15:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:00] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7157 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:13:04] jouncebot: nowandnext [15:13:04] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [15:13:05] In 1 hour(s) and 16 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T1630) [15:13:11] oof, nice [15:13:20] (03CR) 10Ladsgroup: [C: 03+2] Use MainStash instead of db-replicated [extensions/AbuseFilter] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756083 (https://phabricator.wikimedia.org/T272512) (owner: 10Ladsgroup) [15:13:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] CI: add complete mock list of service proxy listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/756609 (https://phabricator.wikimedia.org/T291959) (owner: 10Giuseppe Lavagetto) [15:14:39] (03PS1) 10Jbond: O:mail::mx: block abuse networks on mx hosts [puppet] - 10https://gerrit.wikimedia.org/r/756611 [15:15:13] (03PS3) 10Ladsgroup: Update wikitech etcd readonly exemption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 (owner: 10Majavah) [15:15:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33404/console" [puppet] - 10https://gerrit.wikimedia.org/r/756611 (owner: 10Jbond) [15:15:17] (03CR) 10Ladsgroup: [C: 03+2] Update wikitech etcd readonly exemption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 (owner: 10Majavah) [15:16:06] (03Merged) 10jenkins-bot: Update wikitech etcd readonly exemption [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752134 (owner: 10Majavah) [15:17:07] (03Merged) 10jenkins-bot: CI: add complete mock list of service proxy listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/756609 (https://phabricator.wikimedia.org/T291959) (owner: 10Giuseppe Lavagetto) [15:17:36] !log ladsgroup@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:752134|Update wikitech etcd readonly exemption]] (duration: 00m 49s) [15:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:10] (03PS1) 10MMandere: site: Add drmrs ncredir host [puppet] - 10https://gerrit.wikimedia.org/r/756613 (https://phabricator.wikimedia.org/T282787) [15:25:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [15:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:27:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:41] PROBLEM - Cassandra instance data free space on restbase2018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8333 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:27:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T285149)', diff saved to https://phabricator.wikimedia.org/P19063 and previous config saved to /var/cache/conftool/dbconfig/20220124-152748-marostegui.json [15:27:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [15:27:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [15:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:53] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [15:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [15:28:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2079.codfw.wmnet with reason: Maintenance [15:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [15:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [15:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:28:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T285149)', diff saved to https://phabricator.wikimedia.org/P19064 and previous config saved to /var/cache/conftool/dbconfig/20220124-152820-marostegui.json [15:28:22] (03Merged) 10jenkins-bot: Use MainStash instead of db-replicated [extensions/AbuseFilter] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756083 (https://phabricator.wikimedia.org/T272512) (owner: 10Ladsgroup) [15:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:16] (03CR) 10Hnowlan: [C: 03+1] api-gateway: allow TLS conns to PKI based TLS backends (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/756524 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [15:29:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:09] PROBLEM - Cassandra instance data free space on restbase2018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8139 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:30:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T285149)', diff saved to https://phabricator.wikimedia.org/P19065 and previous config saved to /var/cache/conftool/dbconfig/20220124-153026-marostegui.json [15:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:45] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7016 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:34:26] (03PS3) 10Giuseppe Lavagetto: tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) [15:35:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:08] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me as well!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628) (owner: 10Eigyan) [15:36:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:36:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [15:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [15:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:24] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756572 (owner: 10MarcoAurelio) [15:38:46] (03PS2) 10MarcoAurelio: incubatorwiki: Increase AbuseFilter thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756572 (https://phabricator.wikimedia.org/T299868) [15:38:53] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: wdqs1010, build2001, miscweb1002, restbase2011, labstore1006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:40:41] RECOVERY - Cassandra instance data free space on restbase2010 is OK: DISK OK - free space: /srv/cassandra/instance-data 14212 MB (40% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:40:54] (03PS2) 10JMeybohm: Make a bundle signer return it's root CA [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) [15:40:56] (03PS1) 10JMeybohm: Add ca to multirootca.conf in simple-cfssl [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756616 (https://phabricator.wikimedia.org/T299906) [15:44:56] (03PS2) 10Jbond: P:base::firewall: block abuse_networks by default [puppet] - 10https://gerrit.wikimedia.org/r/756611 [15:45:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P19066 and previous config saved to /var/cache/conftool/dbconfig/20220124-154531-marostegui.json [15:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33405/console" [puppet] - 10https://gerrit.wikimedia.org/r/756611 (owner: 10Jbond) [15:46:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:cacheing-proxy: add support for structured logs [puppet] - 10https://gerrit.wikimedia.org/r/756608 (owner: 10Jbond) [15:46:50] (03CR) 10JMeybohm: Make a bundle signer return it's root CA (035 comments) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [15:47:57] (03CR) 10CDanis: [C: 03+1] "+1 but probably associate this with T270618 ?" [puppet] - 10https://gerrit.wikimedia.org/r/756611 (owner: 10Jbond) [15:47:59] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6900 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:48:35] (03PS3) 10Jbond: P:base::firewall: block abuse_networks by default [puppet] - 10https://gerrit.wikimedia.org/r/756611 (https://phabricator.wikimedia.org/T270618) [15:48:40] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/AbuseFilter/includes/ServiceWiring.php: Backport: [[gerrit:756083|Use MainStash instead of db-replicated (T272512)]] (duration: 00m 49s) [15:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:44] T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken) - https://phabricator.wikimedia.org/T272512 [15:48:52] (03CR) 10Jbond: P:base::firewall: block abuse_networks by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756611 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [15:49:09] (03PS1) 10Jbond: install1003: enable structured logs [puppet] - 10https://gerrit.wikimedia.org/r/756617 [15:49:20] (03CR) 10Jbond: [C: 03+2] P:base::firewall: block abuse_networks by default [puppet] - 10https://gerrit.wikimedia.org/r/756611 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [15:49:48] !log enable abuse_network blocking globally gerrit:756611 [15:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:26] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10Patch-For-Review: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC - https://phabricator.wikimedia.org/T298619 (10CDanis) [15:53:31] 10SRE, 10SRE-OnFire, 10observability, 10Patch-For-Review, 10User-jbond: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569 (10CDanis) [15:53:43] (03CR) 10Elukey: [C: 03+2] api-gateway: allow TLS conns to PKI based TLS backends [deployment-charts] - 10https://gerrit.wikimedia.org/r/756524 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [15:53:52] (03PS3) 10JMeybohm: Make a bundle signer return it's root CA [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) [15:54:21] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/756611 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [15:57:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install clouddumps100[12] - https://phabricator.wikimedia.org/T299610 (10Andrew) [16:00:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P19067 and previous config saved to /var/cache/conftool/dbconfig/20220124-160035-marostegui.json [16:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:03] (03CR) 10Herron: [C: 03+1] prometheus: disable rsync where not needed [puppet] - 10https://gerrit.wikimedia.org/r/756607 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [16:01:16] (03CR) 10Herron: [C: 03+1] site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [16:04:41] (03CR) 10AOkoth: kuberenetes: disable mwautopull timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) (owner: 10AOkoth) [16:05:29] PROBLEM - Cassandra instance data free space on restbase2018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7998 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:07:45] PROBLEM - Check systemd state on ms-be2047 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:09:35] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10AniketArs) Hi @Miriam My wikitech username: AniketArs Preferred shell username: aniketars SSh public key: ` b3BlbnNzaC1rZXktdjEAAAAACmFlczI1Ni1jdHIAAA... [16:12:08] (03CR) 10Joal: [C: 03+1] "Let's have this moving then - not sure how to proceed" [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [16:12:38] (03CR) 10Jbond: [C: 03+2] install1003: enable structured logs [puppet] - 10https://gerrit.wikimedia.org/r/756617 (owner: 10Jbond) [16:13:57] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2047 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:13:59] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:14:09] (03PS1) 10Jdlrobson: Respect useskin when operating in MigrationMode [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756573 (https://phabricator.wikimedia.org/T299171) [16:14:27] PROBLEM - Cassandra instance data free space on restbase2018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8232 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:14:39] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7419 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:15:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T285149)', diff saved to https://phabricator.wikimedia.org/P19068 and previous config saved to /var/cache/conftool/dbconfig/20220124-161540-marostegui.json [16:15:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:15:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [16:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:49] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [16:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T285149)', diff saved to https://phabricator.wikimedia.org/P19069 and previous config saved to /var/cache/conftool/dbconfig/20220124-161549-marostegui.json [16:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:59] RECOVERY - cassandra-a service on restbase2011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:16:38] 10SRE, 10ops-eqiad, 10DC-Ops: Install OpenGear console server (SCS) in new Eqiad cage - https://phabricator.wikimedia.org/T299759 (10Jclark-ctr) [16:17:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T285149)', diff saved to https://phabricator.wikimedia.org/P19070 and previous config saved to /var/cache/conftool/dbconfig/20220124-161757-marostegui.json [16:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:14] (03CR) 10Elukey: profile::cache::kafka::webrequest: Log Sec-CH-UA* headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [16:21:19] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7299 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:22:43] PROBLEM - cassandra-a service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:25:08] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync on staging [16:25:10] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync on production [16:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:53] (03PS1) 10Jdlrobson: Enable migration mode on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756621 (https://phabricator.wikimedia.org/T299927) [16:27:51] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 13257 MB (37% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:28:15] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on restbase2011.codfw.wmnet with reason: bad disk [16:28:17] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on restbase2011.codfw.wmnet with reason: bad disk [16:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:07] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Volans) @jgleeson @Ejegg is there anything else to do here or we can consider this done fo... [16:30:04] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T1630). [16:30:13] (03PS2) 10Dzahn: gerrit: set sshd.enableChannelIdTracking=false [puppet] - 10https://gerrit.wikimedia.org/r/755968 (https://phabricator.wikimedia.org/T263293) (owner: 10Hashar) [16:31:44] (03CR) 10Dzahn: [C: 03+2] "https://bugs.chromium.org/p/gerrit/issues/detail?id=11491 seems to confirm this fixed it for others" [puppet] - 10https://gerrit.wikimedia.org/r/755968 (https://phabricator.wikimedia.org/T263293) (owner: 10Hashar) [16:31:50] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756622 (https://phabricator.wikimedia.org/T128546) [16:32:25] RECOVERY - Check systemd state on ms-be2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:35] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7098 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:33:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P19071 and previous config saved to /var/cache/conftool/dbconfig/20220124-163302-marostegui.json [16:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:23] jouncebot nowandnext [16:33:23] For the next 0 hour(s) and 26 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T1630) [16:33:23] In 1 hour(s) and 26 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T1800) [16:34:36] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756622 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:34:56] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::base::keystone::fernet_keys: drop use of '*' in rsync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/755396 (https://phabricator.wikimedia.org/T299519) (owner: 10Jbond) [16:35:12] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync on staging [16:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:20] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756622 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:38:39] (03CR) 10Dzahn: "Merged and deployed. Have not done a hard restart though." [puppet] - 10https://gerrit.wikimedia.org/r/755968 (https://phabricator.wikimedia.org/T263293) (owner: 10Hashar) [16:39:07] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) hey @Volans, thanks for the reminder. Let's hold out until @ejegg has had a chan... [16:39:17] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7099 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:41:10] (03CR) 10AOkoth: [C: 03+2] cumin: rename OTRS alias to VRTS after role rename [puppet] - 10https://gerrit.wikimedia.org/r/755473 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [16:42:18] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Volans) [16:42:21] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:756622| Bumping portals to master (T128546)]] (duration: 00m 50s) [16:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:25] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:42:25] (03CR) 10AOkoth: "Hey Moritz," [puppet] - 10https://gerrit.wikimedia.org/r/756529 (https://phabricator.wikimedia.org/T293942) (owner: 10Muehlenhoff) [16:43:01] (03PS1) 10Jbond: squid: update cee format [puppet] - 10https://gerrit.wikimedia.org/r/756625 [16:43:11] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:756622| Bumping portals to master (T128546)]] (duration: 00m 49s) [16:43:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:09] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2047 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:45:33] (03CR) 10Muehlenhoff: Also rename otrs Cumin alias to vrts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756529 (https://phabricator.wikimedia.org/T293942) (owner: 10Muehlenhoff) [16:45:51] (03Abandoned) 10Muehlenhoff: Also rename otrs Cumin alias to vrts [puppet] - 10https://gerrit.wikimedia.org/r/756529 (https://phabricator.wikimedia.org/T293942) (owner: 10Muehlenhoff) [16:46:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:46:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:21] (03CR) 10Jbond: [C: 03+2] squid: update cee format [puppet] - 10https://gerrit.wikimedia.org/r/756625 (owner: 10Jbond) [16:47:24] (03PS1) 10BBlack: drmrs: connect lvs bgp to switches [puppet] - 10https://gerrit.wikimedia.org/r/756627 (https://phabricator.wikimedia.org/T282787) [16:47:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:46] arnoldokoth happy for me to merge your change? [16:48:03] RECOVERY - Cassandra instance data free space on restbase2018 is OK: DISK OK - free space: /srv/cassandra/instance-data 26552 MB (66% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:48:05] !log Running nodetool removenode for restbase2011-a [16:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P19072 and previous config saved to /var/cache/conftool/dbconfig/20220124-164807-marostegui.json [16:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:15] RECOVERY - Cassandra instance data free space on restbase2010 is OK: DISK OK - free space: /srv/cassandra/instance-data 21977 MB (62% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:48:31] arnoldokoth: i have merged seemed harmless enough [16:49:21] jbond: Yeah, no worries. Thanks. [16:49:30] np [16:51:42] (03PS1) 10Elukey: helmfile.d: add new image value for api-gateway staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/756628 (https://phabricator.wikimedia.org/T299550) [16:52:11] (03CR) 10Hnowlan: [C: 03+1] helmfile.d: add new image value for api-gateway staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/756628 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [16:52:14] (03CR) 10Dzahn: "this compiles fine in eqiad but not in codfw (there is no cloud in codfw though)" [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [16:57:08] (03CR) 10Dzahn: [C: 04-1] "see inline comments for why" [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [16:58:35] (03PS1) 10Jbond: ecs: post-review [puppet] - 10https://gerrit.wikimedia.org/r/756630 [16:59:32] o/ elukey. Thanks for the review/notes on https://gerrit.wikimedia.org/r/c/operations/puppet/+/755435. Any idea that change and the other will go out and who will be doing it? [17:00:00] (03CR) 10Jbond: ecs: post-review (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [17:01:44] (03PS2) 10Jbond: ecs: post-review [puppet] - 10https://gerrit.wikimedia.org/r/756630 [17:02:18] (03CR) 10BBlack: "Should this be using the public asw addresses like I have here in PS1? Or the private gateways 10.136.[01].1 like we've done in the BIRD c" [puppet] - 10https://gerrit.wikimedia.org/r/756627 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [17:03:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T285149)', diff saved to https://phabricator.wikimedia.org/P19074 and previous config saved to /var/cache/conftool/dbconfig/20220124-170312-marostegui.json [17:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:18] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [17:03:42] phuedx: o/ np! I think during the coming days, will defer to Joseph/Ben's schedule (but I'll help in case needed!) [17:04:07] (03CR) 10Elukey: [C: 03+2] helmfile.d: add new image value for api-gateway staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/756628 (https://phabricator.wikimedia.org/T299550) (owner: 10Elukey) [17:04:46] elukey: <3 Thanks :) [17:07:06] (03PS2) 10BBlack: drmrs: connect lvs bgp to switches [puppet] - 10https://gerrit.wikimedia.org/r/756627 (https://phabricator.wikimedia.org/T282787) [17:09:29] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:37] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2030 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:17:20] 10SRE, 10Privacy Engineering, 10Traffic-Icebox, 10HTTPS, 10Security: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298 (10JFishback_WMF) [17:18:16] (03CR) 10Hnowlan: maps: add cassandra toggle, disable cassandra on maps hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [17:24:32] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync on staging [17:24:33] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync on production [17:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:57] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync on staging [17:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:28] (03PS1) 10AGueyte: Update Event Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [17:26:03] (03CR) 10jerkins-bot: [V: 04-1] Update Event Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [17:28:43] (03PS2) 10AGueyte: Update Event Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [17:29:52] (03CR) 10jerkins-bot: [V: 04-1] Update Event Stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [17:34:13] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:57] (03PS3) 10AGueyte: WIP: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [17:40:41] PROBLEM - PyBal BGP sessions are established on lvs6001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [17:47:36] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) Can i just update the nic firmware or does this need scheduled downtime? [17:47:53] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2030 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:49:15] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Cmjohnson) Marostegui this actually seems like a script issue, you may want to ping @Volans [17:49:32] (03PS2) 10MMandere: site: Add drmrs ncredir host [puppet] - 10https://gerrit.wikimedia.org/r/756613 (https://phabricator.wikimedia.org/T282787) [17:50:42] !log updating firmware on ganeti1013 T299527 [17:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:46] T299527: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 [17:51:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr I think we can assign the CR ports like this: ` cr1-eqiad et-1/0/2 ----> lsw1-e1-eqiad et-0/0/48 cr2-eqiad et-1/0/2... [17:51:31] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6901 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:53:51] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 10976 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:53:59] (03PS3) 10BBlack: drmrs host bgp fixups [puppet] - 10https://gerrit.wikimedia.org/r/756627 (https://phabricator.wikimedia.org/T282787) [17:55:39] I’ll quickly test a patch on deployment/mwdebug, shouldn’t take too long [17:57:11] (03CR) 10Ayounsi: [C: 03+1] drmrs host bgp fixups [puppet] - 10https://gerrit.wikimedia.org/r/756627 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [17:59:21] (03CR) 10BBlack: [C: 03+2] drmrs host bgp fixups [puppet] - 10https://gerrit.wikimedia.org/r/756627 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [18:00:05] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T1800). [18:02:01] (I’m done for now) [18:03:57] RECOVERY - PyBal BGP sessions are established on lvs6001 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [18:04:05] RECOVERY - PyBal BGP sessions are established on lvs6003 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [18:05:17] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [18:06:56] (03CR) 10BBlack: [C: 03+1] "LGTM, Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/756613 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [18:08:22] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) @MoritzMuehlenhoff 1013 is finished, ganeti1014 will need me to do a hard power cycle, I will be able to get to that a little later today. [18:11:14] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Ejegg) Thanks @Volans! I was out sick last week, but today I was able to ack a test alert... [18:17:15] (03PS1) 10Bartosz Dziewoński: Fix showing caption and alt text fields in media and gallery dialogs [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756583 (https://phabricator.wikimedia.org/T299818) [18:17:33] (03PS1) 10BBlack: bird anycast: fix defaulting to local gateway [puppet] - 10https://gerrit.wikimedia.org/r/756639 (https://phabricator.wikimedia.org/T282787) [18:18:46] RECOVERY - Host upload-lb.drmrs.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 86.18 ms [18:18:47] RECOVERY - Host text-lb.drmrs.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 86.13 ms [18:18:48] RECOVERY - Host text-lb.drmrs.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.17 ms [18:18:57] RECOVERY - Host upload-lb.drmrs.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 88.26 ms [18:19:05] (03PS1) 10Bartosz Dziewoński: Revert "Follow-up I0802440d9: Allow alien /'s to be focused" [VisualEditor/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756584 (https://phabricator.wikimedia.org/T298609) [18:22:35] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on an-test-coord1001.eqiad.wmnet with reason: Unmounting /srv to try to repair the filesystem [18:22:37] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-test-coord1001.eqiad.wmnet with reason: Unmounting /srv to try to repair the filesystem [18:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:28] (03CR) 10Bartosz Dziewoński: "I don't remember whether merging this will automatically update the submodule in mediawiki/extensions/VisualEditor, or whether that has to" [VisualEditor/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756584 (https://phabricator.wikimedia.org/T298609) (owner: 10Bartosz Dziewoński) [18:26:36] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Cmjohnson) I've created a ticket with Dell Self-Dispatch for a new CPU. I am doubtful they will send me one since I can n... [18:26:46] (03PS2) 10BBlack: bird anycast: fix defaulting to local gateway [puppet] - 10https://gerrit.wikimedia.org/r/756639 (https://phabricator.wikimedia.org/T282787) [18:28:30] (03CR) 10Brennen Bearnes: Revert "Follow-up I0802440d9: Allow alien /'s to be focused" (031 comment) [VisualEditor/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756584 (https://phabricator.wikimedia.org/T298609) (owner: 10Bartosz Dziewoński) [18:29:41] (03CR) 10Ayounsi: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33409/" [puppet] - 10https://gerrit.wikimedia.org/r/756639 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [18:30:01] (03CR) 10BBlack: [C: 03+2] bird anycast: fix defaulting to local gateway [puppet] - 10https://gerrit.wikimedia.org/r/756639 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [18:33:24] (03CR) 10Brennen Bearnes: Revert "Follow-up I0802440d9: Allow alien /'s to be focused" (031 comment) [VisualEditor/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756584 (https://phabricator.wikimedia.org/T298609) (owner: 10Bartosz Dziewoński) [18:34:32] (03CR) 10Klausman: [C: 03+1] ml-services: add draftquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/756064 (https://phabricator.wikimedia.org/T298989) (owner: 10Accraze) [18:34:46] (03CR) 10jerkins-bot: [V: 04-1] Fix showing caption and alt text fields in media and gallery dialogs [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756583 (https://phabricator.wikimedia.org/T299818) (owner: 10Bartosz Dziewoński) [18:35:51] (03CR) 10Jforrester: Revert "Follow-up I0802440d9: Allow alien /'s to be focused" (031 comment) [VisualEditor/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756584 (https://phabricator.wikimedia.org/T298609) (owner: 10Bartosz Dziewoński) [18:41:07] brennen: oh sorry, i didn't notice you were already fixing the task dependencies [18:42:18] MatmaRex: no worries, you just moved faster than me. :) [18:57:18] (03PS1) 10Andrew Bogott: cinder.conf: use zstd to compress backups [puppet] - 10https://gerrit.wikimedia.org/r/756642 [18:58:03] (03CR) 10Phuedx: [C: 04-1] "Thanks for taking this on 😊 See inline for a note on where to put this code." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [18:58:50] (03PS1) 10Bking: deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) [18:59:33] (03PS1) 104nn1l2: commonswiki: Add ala-images.s3.ap-southeast-2.amazonaws.com to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756644 (https://phabricator.wikimedia.org/T299825) [19:00:02] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] RoanKattouw and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T1900). [19:00:05] Juan_90264, Jdlrobson, dancy, and MatmaRex: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:11] hi [19:00:20] o/ [19:00:22] wow, a lot of patches [19:00:25] hello [19:00:35] dancy: want to lead the window or should I? [19:00:38] (i'm still in a meeting so i can go last) [19:00:45] ack [19:00:53] urbanecm: Take lead please. [19:00:56] okay [19:01:00] I can deploy today then :) [19:01:18] skipping Juan's patches, as he's not here apparently? [19:01:45] (03CR) 10Urbanecm: [C: 03+2] Respect useskin when operating in MigrationMode [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756573 (https://phabricator.wikimedia.org/T299171) (owner: 10Jdlrobson) [19:01:54] (03CR) 10Andrew Bogott: [C: 03+2] cinder.conf: use zstd to compress backups [puppet] - 10https://gerrit.wikimedia.org/r/756642 (owner: 10Andrew Bogott) [19:01:59] Jdlrobson: hi, are you around? [19:02:24] nn1l2: hi, you said "hi", but i don't see any patches from you. Am I missing some? [19:02:27] Hello [19:02:36] I will add it now [19:02:39] hello Juan_90264 [19:02:44] hello [19:02:46] present! [19:02:50] sorry, internet connection problem [19:02:59] (03PS6) 10Urbanecm: Create Draft namespace for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755413 (https://phabricator.wikimedia.org/T299224) (owner: 10Juan90264) [19:03:03] (03CR) 10Urbanecm: [C: 03+2] Create Draft namespace for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755413 (https://phabricator.wikimedia.org/T299224) (owner: 10Juan90264) [19:03:11] hello Jdlrobson ! Great [19:03:18] (03CR) 10Urbanecm: [C: 03+2] Respect useskin when operating in MigrationMode [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756573 (https://phabricator.wikimedia.org/T299171) (owner: 10Jdlrobson) [19:03:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:50] (03CR) 10Majavah: [C: 04-1] deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:04:13] (03Merged) 10jenkins-bot: Create Draft namespace for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755413 (https://phabricator.wikimedia.org/T299224) (owner: 10Juan90264) [19:04:16] (i'm ready now, whenever it's my turn) [19:04:31] Great merged! [19:04:33] (03CR) 10Urbanecm: [C: 03+2] Revert "Follow-up I0802440d9: Allow alien /'s to be focused" [VisualEditor/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756584 (https://phabricator.wikimedia.org/T298609) (owner: 10Bartosz Dziewoński) [19:04:45] done/added now [19:04:49] mwdebug1001 or 1002? [19:05:00] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/756583 has a -1 from jenkins, can you look? [19:05:08] it's selenium, but...checking before i rerun it anyway [19:05:10] looking [19:05:22] Juan_90264: none yet, i have to pull it first [19:05:39] Okay [19:05:42] Juan_90264: pulled at mwdebug1001, please have a look [19:05:58] (03CR) 10Bartosz Dziewoński: "recheck" [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756583 (https://phabricator.wikimedia.org/T299818) (owner: 10Bartosz Dziewoński) [19:05:58] Ok [19:06:24] (03CR) 10Bartosz Dziewoński: "Failure looks unrelated:" [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756583 (https://phabricator.wikimedia.org/T299818) (owner: 10Bartosz Dziewoński) [19:08:02] (03CR) 10Urbanecm: [C: 03+2] "MatmaRex says failure is unrelated => let's start gate-and-submit and hope it'll merge" [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756583 (https://phabricator.wikimedia.org/T299818) (owner: 10Bartosz Dziewoński) [19:08:11] MatmaRex: since you say it's an unrelated failure, started the gate-and-submit jobs too [19:08:17] urbanecm: I tested and approved! [19:08:18] cool [19:08:41] Juan_90264: syncing [19:08:49] (03PS4) 10Urbanecm: Disable RelatedArticles on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756081 (https://phabricator.wikimedia.org/T299873) (owner: 10Juan90264) [19:08:53] (03CR) 10Urbanecm: [C: 03+2] Disable RelatedArticles on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756081 (https://phabricator.wikimedia.org/T299873) (owner: 10Juan90264) [19:09:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:27] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: df86dd42c18809357583f9d094af5a3d6f33c32d: Create Draft namespace for bgwiki (T299224) (duration: 00m 49s) [19:09:31] Jdlrobson: just checking, is the config depending on backport in some way, or can i do config before the backport? [19:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:32] T299224: bgwiki: Add draft namespace - https://phabricator.wikimedia.org/T299224 [19:09:36] Juan_90264: first patch live [19:09:38] !log deleted centralauth.global_user_groups for 10 non-existent users T299650 [19:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:42] T299650: Ghost global accounts having global permissions - https://phabricator.wikimedia.org/T299650 [19:10:12] (03CR) 10Herron: [C: 03+1] site: add Prometheus role to codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/756603 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [19:10:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:10:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:41] urbanecm: config is blocked on backport [19:10:43] Urbanecm: Okay [19:10:50] can do at same time though as well [19:10:52] if that's preferable [19:11:08] (03Merged) 10jenkins-bot: Revert "Follow-up I0802440d9: Allow alien /'s to be focused" [VisualEditor/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756584 (https://phabricator.wikimedia.org/T298609) (owner: 10Bartosz Dziewoński) [19:11:12] I can't really test the backport without the config change [19:11:20] Jdlrobson: ack, thanks. Will do them at once then. [19:11:28] the config change however needs to go out when the backport is there [19:11:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:52] (03Merged) 10jenkins-bot: Disable RelatedArticles on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756081 (https://phabricator.wikimedia.org/T299873) (owner: 10Juan90264) [19:12:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Cmjohnson) updated the second port on cloudsw2-d5 [19:12:17] Juan_90264: your second patch is at mwdebug1001, please test. [19:12:45] Yes, I will test [19:12:49] (03PS2) 10Bking: deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) [19:13:03] MatmaRex: can you please send the submodule update for https://gerrit.wikimedia.org/r/c/VisualEditor/VisualEditor/+/756584/ too, so i can get it to the deployment host? [19:13:34] (03CR) 10jerkins-bot: [V: 04-1] deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:14:53] urbanecm: yes [19:14:58] thanks [19:15:10] (03PS2) 10Urbanecm: commonswiki: Add ala-images.s3.ap-southeast-2.amazonaws.com to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756644 (https://phabricator.wikimedia.org/T299825) (owner: 104nn1l2) [19:15:14] (03CR) 10Urbanecm: [C: 03+2] commonswiki: Add ala-images.s3.ap-southeast-2.amazonaws.com to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756644 (https://phabricator.wikimedia.org/T299825) (owner: 104nn1l2) [19:16:07] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) Ok, basic network and our PDU login applied to all PDUs via SCS connections, next is connection via HTTPS for full configuration of snmp services. [19:16:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:19] (03Merged) 10jenkins-bot: commonswiki: Add ala-images.s3.ap-southeast-2.amazonaws.com to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756644 (https://phabricator.wikimedia.org/T299825) (owner: 104nn1l2) [19:17:31] urbanecm: I tested and approved [19:17:36] thanks, syncing [19:18:49] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2029c35b8a63e8f08b0ce5e0238c2dbee8854377: Disable RelatedArticles on ptwikinews (T299873) (duration: 00m 49s) [19:18:50] (03PS1) 10Bartosz Dziewoński: Update VE core submodule to origin/wmf/1.38.0-wmf.18 [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756650 (https://phabricator.wikimedia.org/T298609) [19:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:53] T299873: remove RelatedArticles extension on ptwikinews - https://phabricator.wikimedia.org/T299873 [19:18:58] Juan_90264: live [19:19:09] nn1l2: your patch is at mwdebug1001, please have a look [19:19:15] ok [19:19:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:19:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:23] (03Merged) 10jenkins-bot: Respect useskin when operating in MigrationMode [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756573 (https://phabricator.wikimedia.org/T299171) (owner: 10Jdlrobson) [19:20:44] (03PS2) 10Urbanecm: Enable migration mode on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756621 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [19:20:48] (03CR) 10Urbanecm: [C: 03+2] Enable migration mode on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756621 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [19:20:53] LGTM, file uploaded by url: https://commons.wikimedia.org/wiki/File:Koala_e13e2b7c-a6a4-4836-8834-080331010a5c.jpg [19:20:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:20:58] great, syncing [19:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:59] w00t [19:21:35] Jdlrobson: not available yet though -- will ping you once it is [19:21:50] MatmaRex: is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/756650/ the submodule update patch? [19:22:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bf2981b6200d760eef8bcc70228a0207ebf9d7cb: commonswiki: Add ala-images.s3.ap-southeast-2.amazonaws.com to the wgCopyUploadsDomains allowlist (T299825) (duration: 00m 49s) [19:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:12] T299825: Add images.ala.org.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T299825 [19:22:14] nn1l2: and, live :) [19:22:19] urbanecm: yes. i was wondering if i should do something about the "Localisation updates" commit that ended up in there [19:22:28] Thanks! [19:22:55] urbanecm: ack [19:22:56] urbanecm: looks like the VE/VE submodule's wmf.18 branch doesn't match the commits that are in the mw/ext/VE wmf.18 branch [19:23:17] yeah. We can try and dig deeper later if it doesn't work? [19:23:19] Both patches are working. Thanks Urbanecm! [19:23:25] no problem Juan_90264 [19:23:28] i guess it was just branched from the master at the time [19:23:40] likely [19:23:48] (03Merged) 10jenkins-bot: Enable migration mode on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756621 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [19:23:58] urbanecm: so, if including the "Localisation updates" is a problem, then we should revert that commit in VE/VE wmf.18 first [19:24:07] if it's not a problem for the deployment, then it's fine to leave it there [19:24:26] i don't think it will be a problem [19:24:31] ok [19:24:46] Juan_90264: your backport+config is at mwdebug1001, please test [19:25:31] (03CR) 10Urbanecm: [C: 03+2] Update VE core submodule to origin/wmf/1.38.0-wmf.18 [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756650 (https://phabricator.wikimedia.org/T298609) (owner: 10Bartosz Dziewoński) [19:25:32] Test? [19:25:39] Juan_90264: sorry, wrong ping [19:25:46] * Jdlrobson: your backport+config is at mwdebug1001, please test [19:25:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:10] urbanecm Okay, no problem [19:27:04] urbanecm: looking [19:27:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:27:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10Andrew) I can easily reproduce the issue on cloudmetrics1003 by putting the post into service, e.g. with https://gerrit.wikimedia.org/r/c/operations/pup... [19:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:41] (03Merged) 10jenkins-bot: Fix showing caption and alt text fields in media and gallery dialogs [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756583 (https://phabricator.wikimedia.org/T299818) (owner: 10Bartosz Dziewoński) [19:31:50] urbanecm: okay good to sync! [19:31:55] Jdlrobson: excellent! [19:31:56] thanks! [19:32:35] MatmaRex: the first VE backport is now at mwdebug1001, please have a look [19:32:52] looking [19:33:02] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10wiki_willy) Thanks @Cmjohnson - let's see how things go with T299744 first, before I escalate that Dell ticket up to the... [19:33:19] urbanecm: seems good [19:33:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:04] thanks MatmaRex, will sync [19:34:48] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.18/skins/Vector/includes/Constants.php: 4f430a8: Respect useskin when operating in MigrationMode (T299171; 1/2) (duration: 00m 48s) [19:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:52] T299171: Can't view old Vector on beta cluster via query string - https://phabricator.wikimedia.org/T299171 [19:35:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:35:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:26] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:35:37] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.18/skins/Vector/: 4f430a8: Respect useskin when operating in MigrationMode (T299171; 2/2) (duration: 00m 48s) [19:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 27c5ab3: Enable migration mode on euwiki (T299927) (duration: 00m 48s) [19:36:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:30] T299927: Deploy new Vector skin to all projects - https://phabricator.wikimedia.org/T299927 [19:36:30] Jdlrobson: and, live [19:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:35] urbanecm: thanks! :D [19:36:37] np [19:38:17] dancy: hey, unfortunately, I have only limited understanding about your two patches. Would you mind to deploy them yourself when I'm done with all the others? [19:38:23] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/VisualEditor/modules/ve-mw/ui/dialogs/: 531efd0: Fix showing caption and alt text fields in media and gallery dialogs (T299818) (duration: 00m 48s) [19:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:27] T299818: Fields for caption and alt text do not appear in media and gallery dialogs - https://phabricator.wikimedia.org/T299818 [19:38:27] MatmaRex: and, live [19:38:37] urbanecm: sure thing. Just lemme know when you're ready. [19:38:43] Will do [19:38:48] thanks. submodule thing next? [19:38:53] MatmaRex: yup. Once it merges. [19:43:48] (03Merged) 10jenkins-bot: Update VE core submodule to origin/wmf/1.38.0-wmf.18 [extensions/VisualEditor] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756650 (https://phabricator.wikimedia.org/T298609) (owner: 10Bartosz Dziewoński) [19:43:53] here we go [19:44:27] MatmaRex: should be at mwdebug1001. Can you check? [19:44:36] (note the i18n change will look like a no-op) [19:45:00] urbanecm: also looks good [19:45:05] that was quick [19:45:09] let's push it out [19:45:55] (03PS3) 10Bking: deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) [19:46:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:03] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/VisualEditor/lib/ve/: a369e0a: Revert "Follow-up I0802440d9: Allow alien / s to be focused" (deployed via e09d79d; T298609; T299730) (duration: 00m 49s) [19:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:08] T299730: VE: Elements placed in bulleted list (* or #) aren't editable by mouse - https://phabricator.wikimedia.org/T299730 [19:47:08] T298609: Table with generated cells explodes when clicked - https://phabricator.wikimedia.org/T298609 [19:47:09] MatmaRex: should be live [19:47:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:47:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:56] dancy: I'm done now. Over to you [19:48:03] urbanecm: thanks [19:48:04] thx [19:48:07] any time MatmaRex [19:48:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Cmjohnson) the ticket was rejected, I am not sure how I can troubleshoot this for them, @wiki_willy you may want to escalate this wit... [19:48:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:57] (03CR) 10Ahmon Dancy: [C: 03+2] Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 (owner: 10Ahmon Dancy) [19:49:43] (03Merged) 10jenkins-bot: Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 (owner: 10Ahmon Dancy) [19:50:12] 10SRE-OnFire: 2021-11-02 Cloud VPS networking - https://phabricator.wikimedia.org/T299964 (10herron) p:05Triage→03Medium [19:50:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10Cmjohnson) the ticket was rejected, I am not sure how I can troubleshoot this for them, @wiki_willy you may want to escalate this with our reps. I will... [19:50:41] 10SRE-OnFire: 2021-11-04 large file upload timeouts - https://phabricator.wikimedia.org/T299965 (10herron) p:05Triage→03Medium [19:51:05] 10SRE-OnFire: 2021-11-05 TOC language converter - https://phabricator.wikimedia.org/T299966 (10herron) p:05Triage→03Medium [19:51:08] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:51:33] 10SRE-OnFire: 2021-11-10 cirrussearch commonsfile outage - https://phabricator.wikimedia.org/T299967 (10herron) p:05Triage→03Medium [19:51:56] 10SRE-OnFire: 2021-11-18 codfw ipv6 network - https://phabricator.wikimedia.org/T299968 (10herron) p:05Triage→03Medium [19:52:05] !log dancy@deploy1002 Synchronized multiversion/MWMultiVersion.php: Config: [[gerrit:743038|Choose wikiversions.php file relative to MWMultiVersion.php]] (duration: 00m 48s) [19:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:20] 10SRE-OnFire: 2021-11-23 Core Network Routing - https://phabricator.wikimedia.org/T299969 (10herron) p:05Triage→03Medium [19:52:24] (03CR) 10Ahmon Dancy: [C: 03+2] MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [19:52:36] (03CR) 10jerkins-bot: [V: 04-1] MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [19:52:52] (03PS4) 10Ahmon Dancy: MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 [19:52:56] 10SRE-OnFire: 2021-11-25 eventgate-main outage - https://phabricator.wikimedia.org/T299970 (10herron) p:05Triage→03Medium [19:53:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:21] (03CR) 10Ahmon Dancy: MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [19:54:27] (03CR) 10Ahmon Dancy: [C: 03+2] MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [19:55:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:55:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:15] (03Merged) 10jenkins-bot: MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [19:56:14] 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10herron) a:03herron [19:56:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:23] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Cmjohnson) the ticket was rejected, I am not sure how I can troubleshoot this for them, @wiki_willy you may want to escal... [19:57:04] !log dancy@deploy1002 Synchronized multiversion/MWMultiVersion.php: Config: [[gerrit:744836|MWMultiVersion.php: Reverse logic for wikiversions file selection]] (duration: 00m 49s) [19:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:01] urbanecm: Done [19:58:07] great! [19:58:12] window is completed then [19:58:54] * dancy wipes brow [19:59:09] * urbanecm looks "brow" up in dictionary [20:00:27] Better to google "wipe brow" and check the images [20:01:04] was a quite crowded window :) [20:01:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:02:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:53] I wonder if we should either start enforcing or remove the "Max 6 patches" part from the calendar, currently basically no-one respects that [20:03:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:25] taavi: good point. I think we should remove the rule. If necessary, deployers will still be able to refuse deployment for any reason, which include "window too crowded" [20:04:29] 10SRE, 10ops-codfw: Degraded RAID on restbase2011 - https://phabricator.wikimedia.org/T299871 (10wiki_willy) a:03Papaul Hi @Eevans - since the refresh for this host was just installed via T294377, are you ok if we ignore this alert and resolve the ticket? Thanks, Willy [20:05:50] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mx1001.wikimedia.org with reason: kernel testing [20:05:51] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx1001.wikimedia.org with reason: kernel testing [20:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:44] RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:09:18] (03CR) 10Zabe: [C: 03+1] incubatorwiki: Increase AbuseFilter thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756572 (https://phabricator.wikimedia.org/T299868) (owner: 10MarcoAurelio) [20:11:04] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7195 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:18:09] (03PS13) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [20:20:38] (03PS14) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [20:22:32] (03CR) 10Majavah: [C: 04-1] "oh, one more thing: deployment-elastic08 has existed at some point (https://phabricator.wikimedia.org/T147777), so we should probably skip" [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:23:49] (03PS15) 10Herron: prometheus: add blackbox generic http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [20:25:16] (03PS16) 10Herron: prometheus: add blackbox generic "watchrat" http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [20:26:04] (03PS5) 10Juan90264: Enable wgMinervaEnableSiteNotice for bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756585 (https://phabricator.wikimedia.org/T299529) [20:37:09] 10SRE-OnFire: 2021-10-29 graphite - https://phabricator.wikimedia.org/T295157 (10jcrespo) a:03jcrespo [20:39:37] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) [20:42:32] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) [20:43:28] (03CR) 10Cwhite: ecs: post-review (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [20:44:57] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10RobH) PDUS in E1-E4 are fully setup with syslog, ntp, snmp settings. E5-E8 have their network setup done, but not their root user (admn only currently), snmp, ntp, or other settings done via https... [20:45:46] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11209 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:47:38] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) Hi Aniket, would you kindly provide the following: 1. Email address 2. Full reasoning for access (including what commands and/or tasks they... [20:47:52] (03CR) 10Cwhite: ecs: post-review (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [20:56:21] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-test-coord1001.eqiad.wmnet with reason: Unmounting /srv to try to repair the filesystem [20:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:24] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-test-coord1001.eqiad.wmnet with reason: Unmounting /srv to try to repair the filesystem [20:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T2100). [21:02:03] (03CR) 10Jgiannelos: [C: 03+1] maps: add cassandra toggle, disable cassandra on maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [21:03:17] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) Great. @Volans42 are you able to demote my account to the level of @Ejegg's and... [21:06:56] (03CR) 10Yahya: [C: 03+1] Enable wgMinervaEnableSiteNotice for bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756585 (https://phabricator.wikimedia.org/T299529) (owner: 10Juan90264) [21:08:57] (03PS4) 10Bking: deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) [21:09:43] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) [21:11:08] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) [21:11:09] (03CR) 10Herron: prometheus: add blackbox generic "watchrat" http/s static check support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [21:13:03] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) [21:13:06] (03CR) 10Herron: ecs: post-review (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [21:15:02] (03CR) 10Cwhite: ecs: post-review (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [21:18:15] urbanecm: around? [21:18:20] !log btullis@deploy1002 Started deploy [analytics/refinery@94ec386] (hadoop-test): (no justification provided) [21:18:21] Re one the patches you deployed [21:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:23] !log btullis@deploy1002 Finished deploy [analytics/refinery@94ec386] (hadoop-test): (no justification provided) (duration: 00m 02s) [21:18:24] somewhat? [21:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:27] what's up [21:18:32] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudmetrics1003.eqiad.wmnet with OS bullseye [21:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:40] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudmetrics1003.eqiad.wm... [21:19:03] urbanecm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/755413/6/wmf-config/InitialiseSettings.php#5811 should have also enabled VE & disabled indexing [21:19:07] Juan_90264: ^ [21:19:35] well, it didn't [21:20:01] urbanecm: no, it seemed to have been missed off [21:20:27] at the very least, not the deployer's fault (patch does what was stated in commit msg) [21:20:43] urbanecm: no, not your fault [21:20:51] Juan_90264: are you able to push a fix? [21:20:59] * RhinosF1 did call it out in the task [21:21:15] either way, I suggest it waits for the next deployment window [21:21:23] Yeah [21:21:36] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Andrew) I'm going to reimage this host with Bullseye and see if I can get any different behavior out of it. [21:32:44] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 and Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [21:33:00] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [21:33:02] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [21:33:42] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring in Q3 have been scored with the scorecard - https://phabricator.wikimedia.org/T299977 (10lmata) [21:33:54] (03CR) 10Cwhite: ecs: post-review (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [21:35:59] (03CR) 10Catrope: "Removing -1, since SBassett has +1ed" [puppet] - 10https://gerrit.wikimedia.org/r/754048 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [21:36:03] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [21:38:48] 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, and 2 others: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10Catrope) [21:38:58] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [21:41:24] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) [21:42:16] (03CR) 10Ebernhardson: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [21:42:43] PROBLEM - snapshot of s4 in codfw on alert1001 is CRITICAL: snapshot for s4 at codfw taken more than 3 days ago: Most recent backup 2022-01-21 21:17:43 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:43:03] (03PS1) 10Cwhite: logstash: remove event.duration when value is hyphen [puppet] - 10https://gerrit.wikimedia.org/r/756683 [21:44:56] (03CR) 10jerkins-bot: [V: 04-1] logstash: remove event.duration when value is hyphen [puppet] - 10https://gerrit.wikimedia.org/r/756683 (owner: 10Cwhite) [21:47:17] PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 32822 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [21:47:19] (03PS2) 10Cwhite: logstash: remove event.duration when value is hyphen [puppet] - 10https://gerrit.wikimedia.org/r/756683 [21:49:09] (03CR) 10Ryan Kemper: "This is ready to be re-enabled, but we're going to add new elastic* hosts to the eqiad fleet on Tuesday (tomorrow) so we'll re-enable sani" [puppet] - 10https://gerrit.wikimedia.org/r/752724 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [21:50:25] (03CR) 10Cwhite: ecs: post-review (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [21:53:42] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: set QUERY_SERVICE env name with wcqs/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/753973 (owner: 10DCausse) [21:55:01] (03CR) 10Ebernhardson: "I dunno if it would help, but there is also an `/etc/query_service` symlink that goes to either `/etc/wdqs` or `/etc/wcqs`. TBH i wouldn'" [puppet] - 10https://gerrit.wikimedia.org/r/753973 (owner: 10DCausse) [21:57:04] !log root@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudmetrics1003.eqiad.wmnet with OS bullseye [21:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:12] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudmetrics1003.eqiad.wmnet... [21:57:40] !log root@cumin1001 START - Cookbook sre.hosts.reimage for host cloudmetrics1003.eqiad.wmnet with OS buster [21:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:48] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudmetrics1003.eqiad.wm... [22:00:04] Reedy and sbassett: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T2200). [22:02:44] (03PS1) 10Ahmon Dancy: WIP merge from master [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/756684 [22:03:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudmetrics1004 potential hardware problem - https://phabricator.wikimedia.org/T299744 (10wiki_willy) [22:03:08] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10wiki_willy) [22:08:42] (03PS1) 10Hashar: gerrit: move CI result table to a tab [puppet] - 10https://gerrit.wikimedia.org/r/756685 [22:09:14] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Error: 503, Backend fetch failed, while the file uploaded fine - https://phabricator.wikimedia.org/T299220 (10Yann) I can't upload https://archive.org/download/TheCollectedWorksOfMahatmaGandhiVolXXXIV/TheCollectedWorksOfMahatmaGandhiVolXXXIV.djvu this file... [22:09:33] (03CR) 10Hashar: [C: 04-1] "With Gerrit 3.4 that seems to work, untested on 3.3." [puppet] - 10https://gerrit.wikimedia.org/r/756685 (owner: 10Hashar) [22:10:23] (03PS1) 10JHathaway: users: remove unnecessary newline [puppet] - 10https://gerrit.wikimedia.org/r/756707 [22:11:51] (03CR) 10JHathaway: [C: 03+2] users: remove unnecessary newline [puppet] - 10https://gerrit.wikimedia.org/r/756707 (owner: 10JHathaway) [22:16:10] (03CR) 10Ryan Kemper: [C: 03+1] "Looks great. Very elegant to use the presence/absence of oauth_settings to decide on whether to pass the port through. Thanks for running " [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [22:16:14] (03CR) 10Ryan Kemper: [C: 03+2] blazegraph: prometheus exporter may bypass nginx [puppet] - 10https://gerrit.wikimedia.org/r/754523 (owner: 10DCausse) [22:17:39] (03PS1) 10JHathaway: Grant skvjold access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) [22:19:39] (03CR) 10JHathaway: "Would love a review" [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway) [22:21:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway) I have made a gerrit patch with the changes that I believe need to be made, https://gerrit.wikimedia.org/r/c/operations/puppet/+/756708, @Volans if yo... [22:22:06] (03PS2) 10Ryan Kemper: wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/755806 (https://phabricator.wikimedia.org/T282117) [22:24:27] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10nshahquinn-wmf) [22:25:27] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10DannyH) I approve. [22:27:50] (03CR) 10Bking: [C: 03+2] wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/755806 (https://phabricator.wikimedia.org/T282117) (owner: 10Ryan Kemper) [22:32:24] !log T280001 T282117 Merged https://gerrit.wikimedia.org/r/c/operations/dns/+/755806 and ran `sudo -i authdns update` on `authdns1001.wikimedia.org` [22:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:31] T282117: WCQS needs to be exposed through a wikimedia.org domain - https://phabricator.wikimedia.org/T282117 [22:32:31] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [22:32:38] (03PS1) 10Gerrit Patch Uploader: bgwiki: fix setup for Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756712 (https://phabricator.wikimedia.org/T299224) [22:32:40] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756712 (https://phabricator.wikimedia.org/T299224) (owner: 10Gerrit Patch Uploader) [22:33:47] (03PS2) 10RhinosF1: bgwiki: fix setup for Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756712 (https://phabricator.wikimedia.org/T299224) (owner: 10Gerrit Patch Uploader) [22:34:19] i got bored of gerrit [22:36:08] lol [22:36:41] tss tss I am watching ;D [22:37:03] I love gerrit hashar [22:37:27] possibly [22:37:47] the fun thing is that one of my kid earlier today told me "I got bored of homework" [22:38:17] it looked like some echo or RhinosF1 watching my home [22:38:26] hah [22:39:09] (03PS1) 10Bking: wcqs: move service into production status [puppet] - 10https://gerrit.wikimedia.org/r/756713 (https://phabricator.wikimedia.org/T280001) [22:39:31] anyway I should stop lurking here. Happy patches and have safe deployments! [22:39:41] Merci Monsieur [22:39:49] (03CR) 10jerkins-bot: [V: 04-1] wcqs: move service into production status [puppet] - 10https://gerrit.wikimedia.org/r/756713 (https://phabricator.wikimedia.org/T280001) (owner: 10Bking) [22:40:43] hashar: 11pm on a monday evening is too late to make git work and gerrit web UI fails with IS.php [22:40:55] so i went to github and just used a patch file [22:42:42] RhinosF1: I also failed to me, until I disabled syntax highlight [22:42:57] (03PS2) 10Bking: wcqs: move service into production status [puppet] - 10https://gerrit.wikimedia.org/r/756713 (https://phabricator.wikimedia.org/T280001) [22:42:58] for large pages is problematic [22:43:05] hauskatze: yeah [22:43:09] files rather [22:43:23] was trying to a quick change to fix an earlier patch [22:44:12] (03CR) 10Ryan Kemper: [C: 03+1] wcqs: move service into production status [puppet] - 10https://gerrit.wikimedia.org/r/756713 (https://phabricator.wikimedia.org/T280001) (owner: 10Bking) [22:44:26] (03CR) 10Bking: [C: 03+2] wcqs: move service into production status [puppet] - 10https://gerrit.wikimedia.org/r/756713 (https://phabricator.wikimedia.org/T280001) (owner: 10Bking) [22:48:19] !log T280001 Moved `wcqs` service state into `production` by merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/756713; running puppet on authdns/alert hosts [22:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:24] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [22:48:41] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudmetrics1003.eqiad.wmnet with OS buster [22:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:49] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudmetrics1003.eqiad.wmnet... [22:54:13] !log T280001 Removed downtime on `wcqs*` [22:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:18] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [22:57:48] (03Abandoned) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [23:09:07] (03CR) 10Bking: [V: 03+1] cirrus: move to search.d.w cert [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [23:09:27] (03CR) 10Ryan Kemper: [C: 03+1] "pcc / implementation both look great! Thanks for working this. I'll let you merge it at your convenience, but if you'd prefer us to merge " [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [23:10:46] (03CR) 10Ryan Kemper: [C: 03+1] ssl: add search.d.w public key [puppet] - 10https://gerrit.wikimedia.org/r/756593 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [23:10:51] (03CR) 10Ryan Kemper: [C: 03+2] ssl: add search.d.w public key [puppet] - 10https://gerrit.wikimedia.org/r/756593 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [23:10:53] (03PS1) 10Ahmon Dancy: MWMultiVersion.php: Add option to read from JSON wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 [23:11:52] (03CR) 10jerkins-bot: [V: 04-1] MWMultiVersion.php: Add option to read from JSON wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 (owner: 10Ahmon Dancy) [23:14:44] (03PS2) 10Ahmon Dancy: MWMultiVersion.php: Add option to read from JSON wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 [23:16:42] jouncebot nowandnext [23:16:42] For the next 0 hour(s) and 43 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220124T2200) [23:16:42] In 0 hour(s) and 43 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T0000) [23:21:35] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:02] (03PS1) 10Ahmon Dancy: Revert "Choose wikiversions.php file relative to MWMultiVersion.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756720 [23:23:47] I'm going to deploy a config fix [23:25:06] (03CR) 10Ahmon Dancy: [C: 03+2] Revert "Choose wikiversions.php file relative to MWMultiVersion.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756720 (owner: 10Ahmon Dancy) [23:25:51] (03Merged) 10jenkins-bot: Revert "Choose wikiversions.php file relative to MWMultiVersion.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756720 (owner: 10Ahmon Dancy) [23:29:12] !log dancy@deploy1002 Synchronized multiversion/MWMultiVersion.php: Config: [[gerrit:756720|Revert "Choose wikiversions.php file relative to MWMultiVersion.php"]] (duration: 00m 49s) [23:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [23:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [23:32:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [23:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:05] (03PS1) 10Andrew Bogott: Make cloudmetrics the primary again [puppet] - 10https://gerrit.wikimedia.org/r/756722 (https://phabricator.wikimedia.org/T299744) [23:33:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [23:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:05] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudmetrics the primary again [puppet] - 10https://gerrit.wikimedia.org/r/756722 (https://phabricator.wikimedia.org/T299744) (owner: 10Andrew Bogott) [23:39:44] (03PS1) 10Ebernhardson: rdf query service: Include host header with proxy_pass [puppet] - 10https://gerrit.wikimedia.org/r/756724 [23:41:35] (03CR) 10jerkins-bot: [V: 04-1] rdf query service: Include host header with proxy_pass [puppet] - 10https://gerrit.wikimedia.org/r/756724 (owner: 10Ebernhardson) [23:43:50] RoanKattouw: if you're deploying, mind doing mine first as it'll be after midnight? [23:46:09] Sure! [23:48:42] (03PS2) 10Ebernhardson: rdf query service: Include host header with proxy_pass [puppet] - 10https://gerrit.wikimedia.org/r/756724 (https://phabricator.wikimedia.org/T295676) [23:49:29] Juan_90264: hi [23:50:04] Hello RhinosF1 [23:50:20] (03CR) 10Ebernhardson: "tested on the codfw hosts, redirects with the hosts headers appear correct." [puppet] - 10https://gerrit.wikimedia.org/r/756724 (https://phabricator.wikimedia.org/T295676) (owner: 10Ebernhardson) [23:51:08] Juan_90264: in 10 minutes, we deploy fix for bgwiki. Did you see my comments? [23:52:00] RhinosF1: Yes I saw it, and thanks for the correction [23:52:28] Juan_90264: no problem