[00:00:54] It is Template:Wikidata Infobox/core with lots of calls to Module:WikidataIB ... but haven't found any edits to any of those. Info is probably there in changeprop / jobqueue jobs somewhere .. need a way to log changes that trigger a large volume of reparse events if it isn't already there somewhere. [00:02:01] subbu: every job logs a reqId in logstash, which should in theory identify the original edit that started the chain [00:02:10] since we preserve and pass that on [00:02:35] I see .. good to know. I need to learn about this sleuthing. [00:02:36] (03PS1) 10Cwhite: logstash: add loki output support [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) [00:02:48] https://logstash.wikimedia.org/goto/c3de970e794549bd76a74fc766cc7254 [00:03:02] I clicked on a random one of those reqId fields [00:03:14] 250,000 entries matching that reqId [00:04:16] I'm confused as to how we have a job that triggers parsoid api requests though [00:04:27] what kind of a job does that? [00:04:36] Template:Wikidata Infobox/i18n/en] ? [00:04:45] TranslateRenderJob [00:04:52] Indeed, that's the first log message [00:06:51] as for parsoid reparses ... the chain is: changeprop -> restbase -> parsoid. [00:06:53] okay, the full story is at https://logstash.wikimedia.org/goto/7e6fae652e7aa6bfa71099f1db98ee6f [00:07:15] the slow-log dashboard isn't a good one to try and repurpose into seeing all message as some of its panels have unmodifiable filters [00:07:27] moved the reqId search to the general mediawiki dashboard instead [00:08:08] it started with /w/index.php?title=Translations:Template:Wikidata_Infobox/i18n/msg-search-depicted/en&action=submit [00:10:15] Jun 28, 2022 @ 18:58 [00:10:23] oye! one translation updated and the whole firestorm started. [00:10:57] this edit then queued a job which ran to completion on a jobrunner (18:58:01) `TranslateRenderJob [Template:Wikidata Infobox/i18n/en]: Finished TranslateRenderJob` [00:11:38] Then six seconds later we see a storm of api.php requests for reasons unclear to me, many of which log errors like `Pool key 'commonswiki:pcache:idhash: .. ' (ArticleView): Usage error: You may only aquire a single non-nowait lock.` [00:13:31] an hour later we see one entry from a parsoid server: `/w/rest.php/commons.wikimedia.org/v3/page/pagebundle/Koch_snowflake/619632125` [00:13:40] again, same reqId chain still. [00:13:49] PROBLEM - Check systemd state on thanos-be2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service,man-db.service,prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2157.codfw.wmnet with reason: host reimage [00:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:52] my brain is fried here ... i'll have to step away for a bit and will look back here to see what you all find and if there is any action needed on our end. [00:18:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2157.codfw.wmnet with reason: host reimage [00:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:02] I'm guessing changeprop isn't just pregenerating a parsoid result in restbase for every edit, but also propagating template edits in its own custom way based on template links information from somewhere, and presumably not in a way that honours the ten years of optimisations we applied to refreshlinks in MW core, nor anything else jobqueue related. [00:20:19] One issue at least that seems in need of investigating further is that these API requests (which I'm guessing start independently from changeprop, I wasn't aware that changeprop knew the edit reqId and re-used it the same way as our jobrunner, thats useful actually, I'm curious where it gets that reqId from though given mw core doesn't trigger that afaik). - that these API requests are managing to very often trigger a poolcounter [00:20:19] error, that shouldn't happen under normal circumstances. [00:22:07] those API requests are also, again for unknown reasons, intiaiting MW sessions, and logging things like `Failed to fetch commonswiki:MWSession:…… : (503) ` [00:22:28] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:23:14] RECOVERY - Check systemd state on es2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:54] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [00:32:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2157.codfw.wmnet with OS bullseye [00:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:10] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2157.codfw.wmnet with OS bullseye completed: - db2... [00:32:45] (03PS9) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) [00:33:13] checking into cirrus failures alert [00:37:53] (03PS10) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) [00:45:21] (03CR) 10DDesouza: "Reduced coverage to exercise caution because we will not be able to take it down during the weekend and we want to disable the survey as s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [00:49:53] !log T310924 Cleared eqiad chi->omega cross cluster settings and reapplied [00:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:59] T310924: Investigate CirrusSearch eqiad failures - https://phabricator.wikimedia.org/T310924 [00:57:18] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [00:58:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2158.codfw.wmnet with OS bullseye [00:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:11] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2158.codfw.wmnet with OS bullseye [00:58:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2155.codfw.wmnet with OS bullseye [00:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:00] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye executed with er... [01:07:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2159.codfw.wmnet with OS bullseye [01:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:36] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2159.codfw.wmnet with OS bullseye [01:17:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2158.codfw.wmnet with reason: host reimage [01:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:05] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [01:20:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2158.codfw.wmnet with reason: host reimage [01:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:34] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:27:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2159.codfw.wmnet with reason: host reimage [01:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2159.codfw.wmnet with reason: host reimage [01:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2158.codfw.wmnet with OS bullseye [01:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:48] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2158.codfw.wmnet with OS bullseye completed: - db2... [01:48:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2159.codfw.wmnet with OS bullseye [01:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:57] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2159.codfw.wmnet with OS bullseye completed: - db2... [01:59:57] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:57] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [02:11:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS bullseye [02:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:07] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2160.codfw.wmnet with OS bullseye [02:11:47] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [02:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:15:16] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [02:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:59] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:17:19] !log bmansurov@deploy1002 deploy aborted: (no justification provided) (duration: 02m 03s) [02:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:34] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [02:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:43] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [02:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:55] (03PS2) 10KartikMistry: Enable Wikistories on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson) [02:47:53] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [02:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:57] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 03s) [02:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:40] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [02:48:42] !log bmansurov@deploy1002 deploy aborted: (no justification provided) (duration: 00m 02s) [02:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:53] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [02:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:01] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [02:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:56] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [02:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:04] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [02:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2160.codfw.wmnet with OS bullseye [02:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:23] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2160.codfw.wmnet with OS bullseye executed with er... [03:18:23] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:32:03] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:59] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:41:41] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:27] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:07] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:57] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:05:29] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:17:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T300472 [05:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:09] T300472: Switchover x1 master (db1103 -> db1120) - https://phabricator.wikimedia.org/T300472 [05:17:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T300472 [05:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1120 with weight 0 T300472', diff saved to https://phabricator.wikimedia.org/P30632 and previous config saved to /var/cache/conftool/dbconfig/20220630-051730-root.json [05:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:11] (03PS2) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472) [05:25:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/809607 (https://phabricator.wikimedia.org/T300472) (owner: 10Marostegui) [05:26:44] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) Thank you! [05:32:45] (03PS1) 10Marostegui: Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/809700 [06:00:05] kormat, marostegui, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T0600). [06:00:15] o/ [06:01:59] Anyone else around? [06:03:13] marostegui: I am [06:03:18] Amir1: o/ [06:03:20] Let's start then [06:03:24] !log Starting x1 eqiad failover from db1103 to db1120 - T300472 [06:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:30] T300472: Switchover x1 master (db1103 -> db1120) - https://phabricator.wikimedia.org/T300472 [06:03:50] Reminder: there is no way to put MW on RO for x1 [06:03:57] So I will do it directly to the master on mysql [06:05:10] all done [06:05:26] Amir1: Can you try to generate a write on x1? [06:05:48] Sure [06:06:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1120 to x1 primary and set section read-write T300472', diff saved to https://phabricator.wikimedia.org/P30633 and previous config saved to /var/cache/conftool/dbconfig/20220630-060601-root.json [06:06:06] ^ now [06:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:31] I see connections on the new master [06:06:48] marostegui: it works [06:06:50] https://w.wiki/5NUn [06:07:34] great! [06:11:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103 T300472', diff saved to https://phabricator.wikimedia.org/P30634 and previous config saved to /var/cache/conftool/dbconfig/20220630-061140-root.json [06:11:55] Everything looks fine [06:12:02] Wohoo [06:12:11] Thanks! [06:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:16:26] (03PS1) 10Marostegui: db1103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809878 (https://phabricator.wikimedia.org/T300099) [06:21:19] (03CR) 10Marostegui: [C: 03+2] db1103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/809878 (https://phabricator.wikimedia.org/T300099) (owner: 10Marostegui) [06:23:44] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: fix proxy selection when using unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/809879 [06:25:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1103.eqiad.wmnet with OS bullseye [06:33:37] (03CR) 10Marostegui: [C: 03+2] Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/809700 (owner: 10Marostegui) [06:33:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1103.eqiad.wmnet with reason: host reimage [06:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 1%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30635 and previous config saved to /var/cache/conftool/dbconfig/20220630-063622-root.json [06:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1103.eqiad.wmnet with reason: host reimage [06:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36134/console" [puppet] - 10https://gerrit.wikimedia.org/r/809879 (owner: 10Giuseppe Lavagetto) [06:39:03] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:01] RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 2%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30636 and previous config saved to /var/cache/conftool/dbconfig/20220630-065126-root.json [06:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:41] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:09] PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1103.eqiad.wmnet with OS bullseye [06:56:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 1%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30637 and previous config saved to /var/cache/conftool/dbconfig/20220630-065621-root.json [06:56:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/809594 (owner: 10Slyngshede) [06:56:25] (03PS1) 10Marostegui: Revert "db1103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/809887 [06:56:27] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] profile::prometheus::ops enable Ganeti metric scraping. [puppet] - 10https://gerrit.wikimedia.org/r/809594 (owner: 10Slyngshede) [06:57:20] (03CR) 10Marostegui: [C: 03+2] Revert "db1103: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/809887 (owner: 10Marostegui) [06:58:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P30638 and previous config saved to /var/cache/conftool/dbconfig/20220630-065857-root.json [06:59:46] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: fix proxy selection when using unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/809879 [07:00:05] Amir1 and apergos: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T0700). [07:00:17] hello everybody! [07:00:24] there are no trainees signed up for today's window [07:00:38] and that's a good thing because there are also no patches scheduled for deployment :-D [07:01:00] if anyone wants to step up and self deploy, now's the time, in about 15 minutes I'm going to wander off. [07:11:22] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36135/console" [puppet] - 10https://gerrit.wikimedia.org/r/809879 (owner: 10Giuseppe Lavagetto) [07:11:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 2%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30639 and previous config saved to /var/cache/conftool/dbconfig/20220630-071125-root.json [07:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36136/console" [puppet] - 10https://gerrit.wikimedia.org/r/809879 (owner: 10Giuseppe Lavagetto) [07:15:06] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::php: fix proxy selection when using unix sockets [puppet] - 10https://gerrit.wikimedia.org/r/809879 (owner: 10Giuseppe Lavagetto) [07:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 weight', diff saved to https://phabricator.wikimedia.org/P30640 and previous config saved to /var/cache/conftool/dbconfig/20220630-071522-marostegui.json [07:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P30641 and previous config saved to /var/cache/conftool/dbconfig/20220630-071526-root.json [07:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/807983 (owner: 10Slyngshede) [07:17:11] welp 15 minutes later no one has stepped up so that's it for today [07:18:55] 10SRE, 10DNS, 10Infrastructure-Foundations, 10Mail, and 2 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10jcrespo) [07:19:28] (03CR) 10Filippo Giunchedi: [C: 04-1] "See also extended rationale in the task, I don't think this is necessary" [puppet] - 10https://gerrit.wikimedia.org/r/808040 (https://phabricator.wikimedia.org/T311262) (owner: 10Herron) [07:21:44] 10SRE, 10DNS, 10Infrastructure-Foundations, 10Mail, and 2 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10jcrespo) I created this when I saw someone mentioning it on discord. Ping @Vgutierrez @BBlack (I personally have no thought, I didn't know this was a th... [07:21:52] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [07:24:09] (03CR) 10Filippo Giunchedi: "Idea LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [07:25:51] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [07:26:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 5%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30642 and previous config saved to /var/cache/conftool/dbconfig/20220630-072629-root.json [07:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:47] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [07:26:54] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:26:58] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:30:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 2%: After reimage', diff saved to https://phabricator.wikimedia.org/P30643 and previous config saved to /var/cache/conftool/dbconfig/20220630-073030-root.json [07:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:48] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] class role::apt_repo switch apt-repo to Apache2, from nginx. [puppet] - 10https://gerrit.wikimedia.org/r/807983 (owner: 10Slyngshede) [07:36:26] RECOVERY - Disk space on thanos-be2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [07:36:39] (03Abandoned) 10Filippo Giunchedi: am: add 'host' label and add port to 'instance' [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/763460 (https://phabricator.wikimedia.org/T300951) (owner: 10Filippo Giunchedi) [07:37:34] RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30644 and previous config saved to /var/cache/conftool/dbconfig/20220630-074133-root.json [07:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:32] !log Move apt repository to Apache2, from Nginx https://gerrit.wikimedia.org/r/c/operations/puppet/+/807983 [07:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P30645 and previous config saved to /var/cache/conftool/dbconfig/20220630-074534-root.json [07:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:14] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8037616, @elukey wrote: > If everybody agrees I'd keep Buster for the moment, and possibly ML could be the first cluster to be upgraded when Cassadra 4 is import... [07:52:47] 10SRE-swift-storage: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) [07:53:14] (03PS1) 10Aklapper: Phabricator: Remove unneeded translation overrides [puppet] - 10https://gerrit.wikimedia.org/r/809907 (https://phabricator.wikimedia.org/T309746) [07:55:10] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Marostegui) [07:56:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30646 and previous config saved to /var/cache/conftool/dbconfig/20220630-075637-root.json [07:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:00] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) Definitely :) The main worry that I have now is that moving to Bullseye for Cassandra nodes will mean upgrading to 4.x at this point, unless we find a way to move cqlsh.py to python 3 in... [08:00:05] dduvall and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T0800). [08:00:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P30647 and previous config saved to /var/cache/conftool/dbconfig/20220630-080038-root.json [08:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:32] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:17] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) I checked in the jira that was pointed out earlier, and I noticed two things: 1) Most of the subtasks are related to finding how to test things with python3 etc.. 2) All the discussions... [08:10:32] PROBLEM - HTTPS on apt1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/APT_repository [08:11:23] slyngs: ^^ [08:11:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30648 and previous config saved to /var/cache/conftool/dbconfig/20220630-081140-root.json [08:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:25] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache2001.codfw.wmnet with OS buster [08:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P30649 and previous config saved to /var/cache/conftool/dbconfig/20220630-081542-root.json [08:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:00] !log elukey@deploy1002 Started deploy [ores/deploy@dfaec93]: Update ores submodule to its latest commit and scap canary settings [08:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:51] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 (10ayounsi) 05Open→03Resolved a:03ayounsi Awesome, thanks a lot @ssingh I slightly cleaned up the doc (added a mention of the bird2 upgrade) And updated the dashboard at https://g... [08:20:38] (03PS2) 10Giuseppe Lavagetto: wmflib::service::get_url: avoid using monitoring to find the url. [puppet] - 10https://gerrit.wikimedia.org/r/800010 [08:22:35] 10SRE, 10ops-codfw, 10DBA: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 (10Marostegui) 05Open→03Resolved Data looks fine, resolving. [08:24:42] (03PS1) 10Slyngshede: P:aptrepo::wikimedia enable OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/809911 [08:24:56] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:26:06] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage [08:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30650 and previous config saved to /var/cache/conftool/dbconfig/20220630-082644-root.json [08:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:27] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache2002.codfw.wmnet with OS buster [08:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2001.codfw.wmnet with reason: host reimage [08:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36137/console" [puppet] - 10https://gerrit.wikimedia.org/r/809911 (owner: 10Slyngshede) [08:29:53] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: switch ip_allow.config to YAML format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803272 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [08:30:01] 10SRE, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10greg) The email team in fundraising has interest in this topic as well. [08:30:25] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: 9.x upgrade: replace client.verify.server [puppet] - 10https://gerrit.wikimedia.org/r/803296 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [08:30:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P30651 and previous config saved to /var/cache/conftool/dbconfig/20220630-083046-root.json [08:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:54] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:48] !log elukey@deploy1002 Finished deploy [ores/deploy@dfaec93]: Update ores submodule to its latest commit and scap canary settings (duration: 14m 48s) [08:33:52] (03PS1) 10Muehlenhoff: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) [08:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:46] (03CR) 10CI reject: [V: 04-1] Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [08:35:32] 10SRE, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10jcrespo) Probably related: T211404 T167337 [08:38:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wmflib::service::get_url: avoid using monitoring to find the url. [puppet] - 10https://gerrit.wikimedia.org/r/800010 (owner: 10Giuseppe Lavagetto) [08:39:26] (03CR) 10Elukey: [C: 03+1] ores: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809625 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:40:10] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:52] 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) @MatthewVernon hi! Do you have any guidance about how to proceed? [08:41:15] 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) [08:41:28] 10SRE-swift-storage, 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) [08:41:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: After on-site maintenance', diff saved to https://phabricator.wikimedia.org/P30652 and previous config saved to /var/cache/conftool/dbconfig/20220630-084148-root.json [08:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:01] (03PS2) 10Muehlenhoff: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) [08:42:02] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage [08:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:46] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache2003.codfw.wmnet with OS buster [08:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache2002.codfw.wmnet with reason: host reimage [08:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P30653 and previous config saved to /var/cache/conftool/dbconfig/20220630-084550-root.json [08:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from x1 master - not neeed anymore', diff saved to https://phabricator.wikimedia.org/P30654 and previous config saved to /var/cache/conftool/dbconfig/20220630-084621-marostegui.json [08:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:12] (03PS3) 10Muehlenhoff: Extend custom raid fact to support Perc 750 [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) [08:56:24] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage [08:56:26] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-cache2003.codfw.wmnet with reason: host reimage [08:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2001.codfw.wmnet with OS buster [08:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:55] (03PS54) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [09:07:07] (03PS1) 10Muehlenhoff: Enable component/ganeti3 for codfw [puppet] - 10https://gerrit.wikimedia.org/r/809920 (https://phabricator.wikimedia.org/T311686) [09:10:32] (03PS1) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [09:16:08] (03Abandoned) 10Slyngshede: P:aptrepo::wikimedia enable OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/809911 (owner: 10Slyngshede) [09:17:43] (03CR) 10Muehlenhoff: [C: 03+2] Enable component/ganeti3 for codfw [puppet] - 10https://gerrit.wikimedia.org/r/809920 (https://phabricator.wikimedia.org/T311686) (owner: 10Muehlenhoff) [09:19:17] ACKNOWLEDGEMENT - HTTPS on apt1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response Slyngshede waiting for rollback https://wikitech.wikimedia.org/wiki/APT_repository [09:26:21] (03PS1) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 [09:28:45] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36139/console" [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [09:28:48] (03CR) 10CI reject: [V: 04-1] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede) [09:32:21] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:40] (03PS1) 10Giuseppe Lavagetto: Blackify python files [software/benchmw] - 10https://gerrit.wikimedia.org/r/809924 [09:34:42] (03PS1) 10Giuseppe Lavagetto: Add --cookie command line option [software/benchmw] - 10https://gerrit.wikimedia.org/r/809925 [09:34:51] (03PS1) 10Marostegui: instances.yaml: Remove db2083 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/809926 (https://phabricator.wikimedia.org/T311695) [09:35:07] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36140/console" [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede) [09:35:08] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on sretest1001.eqiad.wmnet with reason: Testing [09:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:21] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on sretest1001.eqiad.wmnet with reason: Testing [09:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2002.codfw.wmnet with OS buster [09:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:57] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2083 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/809926 (https://phabricator.wikimedia.org/T311695) (owner: 10Marostegui) [09:38:42] (03PS2) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 [09:41:11] (03CR) 10CI reject: [V: 04-1] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede) [09:42:01] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:17] (03PS3) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 [09:42:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2083 from dbctl', diff saved to https://phabricator.wikimedia.org/P30655 and previous config saved to /var/cache/conftool/dbconfig/20220630-094239-marostegui.json [09:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:49] (03CR) 10Vgutierrez: [V: 03+1] Implement MediaWiki multi-DC traffic component (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/801621 (https://phabricator.wikimedia.org/T91820) (owner: 10Tim Starling) [09:44:49] (03CR) 10CI reject: [V: 04-1] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede) [09:47:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache2003.codfw.wmnet with OS buster [09:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:17] (03PS4) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 [09:55:47] (03CR) 10CI reject: [V: 04-1] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede) [09:56:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] Blackify python files [software/benchmw] - 10https://gerrit.wikimedia.org/r/809924 (owner: 10Giuseppe Lavagetto) [09:59:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add --cookie command line option [software/benchmw] - 10https://gerrit.wikimedia.org/r/809925 (owner: 10Giuseppe Lavagetto) [10:00:04] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1000) [10:03:05] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Blackify python files [software/benchmw] - 10https://gerrit.wikimedia.org/r/809924 (owner: 10Giuseppe Lavagetto) [10:06:46] (03PS1) 10Stang: Fixes Content sub unreadable in Vector 22 [skins/Vector] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809890 (https://phabricator.wikimedia.org/T311564) [10:08:50] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the addition" [cookbooks] - 10https://gerrit.wikimedia.org/r/809599 (https://phabricator.wikimedia.org/T311593) (owner: 10Muehlenhoff) [10:10:27] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add --cookie command line option [software/benchmw] - 10https://gerrit.wikimedia.org/r/809925 (owner: 10Giuseppe Lavagetto) [10:12:45] (03CR) 10Klausman: [C: 03+1] api-gateway: allow discovery services to set custom rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/809198 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [10:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:17:16] (03CR) 10Ladsgroup: [C: 03+1] multiversion: Move missing.php from wmf-config/ to /multiversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807610 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [10:17:43] (03CR) 10Ladsgroup: [C: 03+1] missing.php: Update docs and add test plan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807609 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [10:24:23] (03PS5) 10Slyngshede: P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 [10:28:19] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:31:10] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [10:33:34] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede) [10:37:44] (03CR) 10Slyngshede: [C: 03+2] P:aptrepo::wikimedia rollback Apache migration, due to OCSP stapling. [puppet] - 10https://gerrit.wikimedia.org/r/809923 (owner: 10Slyngshede) [10:38:59] (03CR) 10Volans: [C: 03+1] "I don't see it used either, and I don't recall the context." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/806908 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [10:40:48] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:52] (03PS11) 10Volans: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi) [10:42:32] (03PS1) 10MarcoAurelio: Amend license request contact form per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809932 (https://phabricator.wikimedia.org/T303359) [10:44:56] PROBLEM - DPKG on apt2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:49:20] RECOVERY - HTTPS on apt1001 is OK: SSL OK - OCSP staple validity for apt.wikimedia.org has 281440 seconds left:Certificate apt.wikimedia.org valid until 2022-08-08 04:49:26 +0000 (expires in 38 days) https://wikitech.wikimedia.org/wiki/APT_repository [10:51:42] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:55:16] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [10:56:52] PROBLEM - DPKG on apt1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:57:45] (03CR) 10Vgutierrez: Revert "Cache Badtitle 400s for 60s in varnish-fe" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [10:59:06] ACKNOWLEDGEMENT - DPKG on apt1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Slyngshede Issue installing an reinstalling nginx https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:59:06] ACKNOWLEDGEMENT - DPKG on apt2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Slyngshede Issue installing an reinstalling nginx https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:09:40] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [11:16:22] RECOVERY - DPKG on apt2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:23:38] (03PS1) 10Slyngshede: P:puppet:agent run puppet agent one minute after boot. [puppet] - 10https://gerrit.wikimedia.org/r/809943 [11:28:18] RECOVERY - DPKG on apt1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:33:40] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:54] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:35:14] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36141/console" [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede) [11:38:24] (03CR) 10Slyngshede: P:puppet:agent run puppet agent one minute after boot. [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede) [11:43:18] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:43:24] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P30657 and previous config saved to /var/cache/conftool/dbconfig/20220630-114419-ladsgroup.json [11:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:43] 10SRE-swift-storage: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 (10fgiunchedi) A sample of said utilization (on thanos-be2002, other hosts are similar) {F35289121} [11:47:55] (03CR) 10Muehlenhoff: [C: 03+2] Disable swap before running wipefs [cookbooks] - 10https://gerrit.wikimedia.org/r/809599 (https://phabricator.wikimedia.org/T311593) (owner: 10Muehlenhoff) [11:48:16] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [11:51:30] (03PS6) 10Filippo Giunchedi: icinga: remove 'monitoring' from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) [11:54:41] (03Abandoned) 10Kosta Harlan: Structured task: Add 'cancel' to the list of allowed commands [extensions/GrowthExperiments] (wmf/1.39.0-wmf.17) - 10https://gerrit.wikimedia.org/r/809549 (https://phabricator.wikimedia.org/T311467) (owner: 10Kosta Harlan) [11:57:53] (03PS1) 10Filippo Giunchedi: swift: heavier banhammer for tegola object-server 'access logs' [puppet] - 10https://gerrit.wikimedia.org/r/809966 (https://phabricator.wikimedia.org/T297959) [11:59:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P30658 and previous config saved to /var/cache/conftool/dbconfig/20220630-115923-ladsgroup.json [11:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:42] (03CR) 10Filippo Giunchedi: "I noticed thanos-be2001 filling up root FS with logs again, turns out (in retrospect obviously) that we were banning based on container na" [puppet] - 10https://gerrit.wikimedia.org/r/809966 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi) [12:02:10] 10SRE, 10SRE Observability, 10User-fgiunchedi: systemd state on thanos-fe1001 is flapping - https://phabricator.wikimedia.org/T311322 (10fgiunchedi) I think I tracked the root cause down, an entry for `thanos-swift.discovery.wmnet` pointing to codfw was present in `/etc/hosts`. TBH I can't remember if I did... [12:02:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "I believe this is good to be merged now!" [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [12:10:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede) [12:11:27] (03PS1) 10Ssingh: bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968 [12:12:06] (03CR) 10CI reject: [V: 04-1] bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh) [12:12:08] (03CR) 10Muehlenhoff: [C: 03+2] Add Paul Norman to contributors [puppet] - 10https://gerrit.wikimedia.org/r/809629 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:12:56] (03PS2) 10Ssingh: bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968 [12:14:10] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36143/console" [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh) [12:14:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P30659 and previous config saved to /var/cache/conftool/dbconfig/20220630-121427-ladsgroup.json [12:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:41] (03CR) 10Ssingh: bird: add validate_cmd for bird.conf [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh) [12:15:09] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Papaul) [12:15:32] (03PS1) 10Slyngshede: P:aptrepo::wikimedia move private repo to nginx and uninstall apache [puppet] - 10https://gerrit.wikimedia.org/r/809969 [12:16:32] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [12:17:07] (03CR) 10Urbanecm: "code looks good, i just want to highlight an approval required by a comment (can't find a sign of approval on the patch or task)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson) [12:17:26] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [12:18:57] (03CR) 10CI reject: [V: 04-1] QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [12:20:59] (03PS2) 10Slyngshede: P:puppet:agent run puppet agent one minute after startup. [puppet] - 10https://gerrit.wikimedia.org/r/809943 [12:23:45] (03PS2) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [12:23:47] (03PS55) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [12:25:39] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [12:26:14] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36144/console" [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede) [12:26:20] (03CR) 10Ssingh: [V: 03+1] "(removed +1 by mistake that PCC added)" [puppet] - 10https://gerrit.wikimedia.org/r/809968 (owner: 10Ssingh) [12:29:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P30660 and previous config saved to /var/cache/conftool/dbconfig/20220630-122931-ladsgroup.json [12:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:42] (03CR) 10David Caro: [C: 03+2] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [12:30:04] (03CR) 10David Caro: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [12:31:18] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [12:31:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede) [12:31:59] (03PS56) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [12:32:01] (03PS3) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [12:32:42] (03CR) 10David Caro: [C: 03+2] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [12:35:14] (03PS11) 10DDesouza: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) [12:36:36] (03PS1) 10Marostegui: mariadb: Remove db2083 [puppet] - 10https://gerrit.wikimedia.org/r/809975 (https://phabricator.wikimedia.org/T311695) [12:36:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2083.codfw.wmnet [12:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:35] (03CR) 10DDesouza: "Fixed spaces being used for indentation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [12:39:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2083 [puppet] - 10https://gerrit.wikimedia.org/r/809975 (https://phabricator.wikimedia.org/T311695) (owner: 10Marostegui) [12:39:45] (03PS1) 10Papaul: ADD new PDU model to ps1-a4-codfw [puppet] - 10https://gerrit.wikimedia.org/r/809977 (https://phabricator.wikimedia.org/T309957) [12:40:43] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [12:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:27] (03CR) 10KartikMistry: Enable Wikistories on idwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson) [12:42:13] James_F: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/809165 - can you take a look when you're available. [12:44:36] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:44:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2083.codfw.wmnet [12:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:28] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Marostegui) a:03Papaul [12:47:39] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Marostegui) @Papaul this is ready! [12:47:47] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Marostegui) [12:49:15] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Papaul) [12:53:05] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:puppet:agent run puppet agent one minute after startup. [puppet] - 10https://gerrit.wikimedia.org/r/809943 (owner: 10Slyngshede) [12:54:22] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:55:35] (03PS1) 10Volans: sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983 [12:58:39] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983 (owner: 10Volans) [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1300). [13:00:05] koi and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:29] hi [13:00:33] !log uploaded php-defaults 76+wmf1+buster2 for component/php74 (drops a Breaks: on php72-common) T311386 [13:00:39] hi! i can deploy today [13:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:41] T311386: Install php 7.4 in production - https://phabricator.wikimedia.org/T311386 [13:01:49] (03CR) 10Urbanecm: [C: 03+2] Fixes Content sub unreadable in Vector 22 [skins/Vector] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809890 (https://phabricator.wikimedia.org/T311564) (owner: 10Stang) [13:01:59] koi: I'll let you know once this is ready to be tested [13:02:01] hi, i'm here [13:02:04] hi kostajh [13:02:10] got it, thanks [13:02:26] (03PS5) 10Urbanecm: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [13:02:31] (03CR) 10Urbanecm: [C: 03+2] Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [13:04:23] (03Merged) 10jenkins-bot: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807576 (https://phabricator.wikimedia.org/T304099) (owner: 10MewOphaswongse) [13:04:45] kostajh: pulled to mwdebug1001, can you have a look? [13:04:50] urbanecm: yep, one moment [13:04:50] (03PS2) 10Volans: sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983 [13:05:48] !log upgrade mwdebug* servers to 2:76+wmf1~buster2 T311386 [13:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:54] T311386: Install php 7.4 in production - https://phabricator.wikimedia.org/T311386 [13:07:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:49] urbanecm: lgtm [13:08:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:08:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:57] kostajh: thanks, syncing [13:09:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED [13:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:27] (03CR) 10Slyngshede: [C: 03+1] "LGTM,PCI ID looks correct: https://pci-ids.ucw.cz/read/PC/1000/10e2" [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [13:11:03] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983 (owner: 10Volans) [13:12:12] (03PS1) 10David Caro: wmcs.puppet_alert: properly check if files exist [puppet] - 10https://gerrit.wikimedia.org/r/809987 [13:12:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] Allow mwbuilder group to access mwdeploy key [puppet] - 10https://gerrit.wikimedia.org/r/809712 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [13:13:07] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: fb399065b123db536ae244a0c0fada61eb906a6e: Structured task: enable free text for "other" rejection reason (T304099) (duration: 03m 46s) [13:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:13] T304099: Structured tasks: temporary free text for "other" rejection reason - https://phabricator.wikimedia.org/T304099 [13:13:51] (03Merged) 10jenkins-bot: sre.hosts.reimage: fix --no-pxe puppet behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/809983 (owner: 10Volans) [13:14:10] kostajh: and, should be live. anything else? [13:14:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:14:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10BTullis) Hi @Cmjohnson - that's really interesting. I think that you're one step closer to a working system than I am, but ultimately I think... [13:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:24] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36145/console" [puppet] - 10https://gerrit.wikimedia.org/r/809969 (owner: 10Slyngshede) [13:14:30] urbanecm: that's all, thank you [13:14:35] no problem :) [13:14:49] kostajh: fwiw i don't see the other box at cswiki, but i guess that's because it's wmf.17 i guess? [13:15:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:15:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:39] (03CR) 10Raymond Ndibe: [C: 03+1] wmcs.puppet_alert: properly check if files exist [puppet] - 10https://gerrit.wikimedia.org/r/809987 (owner: 10David Caro) [13:16:58] !log installing libsndfile security updates [13:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:46] (03Merged) 10jenkins-bot: Fixes Content sub unreadable in Vector 22 [skins/Vector] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809890 (https://phabricator.wikimedia.org/T311564) (owner: 10Stang) [13:18:54] (03CR) 10Slavina Stefanova: [C: 03+1] wmcs.puppet_alert: properly check if files exist [puppet] - 10https://gerrit.wikimedia.org/r/809987 (owner: 10David Caro) [13:19:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bullseye [13:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:15] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye [13:19:52] (03CR) 10David Caro: [C: 03+2] "Thanks both!" [puppet] - 10https://gerrit.wikimedia.org/r/809987 (owner: 10David Caro) [13:19:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2155.codfw.wmnet with reason: host reimage [13:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:01] koi: your patch is at mwdebug1001, can you check? [13:21:06] looking [13:21:54] urbanecm: LGTM [13:21:58] thanks, syncing [13:22:41] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:50] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 09s) [13:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:17] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:25] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2155.codfw.wmnet with reason: host reimage [13:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:24:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:51] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.18/skins/Vector/resources/skins.vector.styles/layouts/screen.less: a927e6fbf56f031c42737cd9710eb0531bab43e1: Fixes Content sub unreadable in Vector 22 (T311564) (duration: 03m 18s) [13:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:59] T311564: Content sub unreadable in Vector 22 - https://phabricator.wikimedia.org/T311564 [13:26:01] koi: and it's live. anything else? [13:26:04] PROBLEM - PHP7 rendering on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:26:08] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:16] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:54] urbanecm: correct, the config patch requires wmf.18 to take effect [13:27:07] urbanecm: one question, is bacc window a proper place to schedule a maintenance script run? [13:27:45] koi: on that, you need to sync with someone who can run them. if it's a simple one, i can run it now :) [13:28:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host stat1009.eqiad.wmnet with OS buster [13:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS buster [13:29:30] urbanecm: cool! would you like to have a look at T311012? [13:29:31] T311012: Attach account "New user message" to its global account - https://phabricator.wikimedia.org/T311012 [13:29:41] !log slyngshede@cumin1001 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [13:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host poolcounter2004.codfw.wmnet [13:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:31] !log [urbanecm@mwmaint1002 /srv/mediawiki/php]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=rowiki --userlist /home/urbanecm/users.txt # T311012, users.txt has `New user message` only [13:32:34] koi: here you go :) [13:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:26] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:29] thanks a lot :) [13:33:57] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host stat1009.eqiad.wmnet with OS buster [13:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS buster executed with errors: - stat100... [13:34:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2155.codfw.wmnet with OS bullseye [13:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:09] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2155.codfw.wmnet with OS bullseye completed: - db2... [13:34:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host stat1009.eqiad.wmnet with OS bullseye [13:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye [13:34:53] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [13:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2004.codfw.wmnet [13:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:19] !log run `CentralAuthUser::importLocalNames` for `MediaWiki message delivery` (T275935) [13:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:25] T275935: Please manually attach new MassMessage system accounts on Wikimedia wikis - https://phabricator.wikimedia.org/T275935 [13:35:57] !log [urbanecm@mwmaint1002 /srv/mediawiki/php]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=rowiki --userlist /home/urbanecm/users.txt # T275935, users.txt has `MediaWiki message delivery` only [13:36:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host poolcounter2003.codfw.wmnet [13:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:36:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T309311)', diff saved to https://phabricator.wikimedia.org/P30661 and previous config saved to /var/cache/conftool/dbconfig/20220630-133619-ladsgroup.json [13:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:26] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [13:37:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30662 and previous config saved to /var/cache/conftool/dbconfig/20220630-133743-ladsgroup.json [13:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:49] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:38:25] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:38:28] !log bmansurov@deploy1002 deploy aborted: (no justification provided) (duration: 00m 03s) [13:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:41] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [13:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:50] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [13:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:02] !log run `CentralAuthUser::importLocalNames` for `New user message` (T311012) [13:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:08] T311012: Attach account "New user message" to its global account - https://phabricator.wikimedia.org/T311012 [13:39:38] !log [urbanecm@mwmaint1002 /srv/mediawiki/php]$ mwscript extensions/CentralAuth/maintenance/attachAccount.php --wiki=metawiki --userlist /home/urbanecm/users.txt # T311012, users.txt has `New user message` only [13:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2003.codfw.wmnet [13:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:05] !log killed refreshLinkRecommendations.php on arzwiki (T299021) [13:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:12] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [13:40:20] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:18] (03CR) 10Jgiannelos: [C: 03+1] swift: heavier banhammer for tegola object-server 'access logs' [puppet] - 10https://gerrit.wikimedia.org/r/809966 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi) [13:42:29] (03PS1) 10Muehlenhoff: Depool poolcounter1005 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809990 [13:43:20] RECOVERY - PHP7 rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 302 Found - 564 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:47:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2154.codfw.wmnet with OS bullseye [13:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:54] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye [13:48:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2154.codfw.wmnet with reason: host reimage [13:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:36] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host stat1009.eqiad.wmnet with OS bullseye [13:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host stat1009.eqiad.wmnet with OS bullseye executed with errors: - stat1... [13:48:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) [13:49:12] (03PS3) 10Giuseppe Lavagetto: mediawiki: install php7.4 on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/808909 (https://phabricator.wikimedia.org/T311386) [13:49:40] PROBLEM - Check systemd state on poolcounter2003 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: host reimage [13:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:03] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [13:53:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) @BTullis @RobH @Papaul I set the raid up so the raid 1 ssds were first and used the install script for buster. Buster fails to see to the disks, so I... [13:54:49] (03CR) 10Ladsgroup: [C: 03+1] build: Remove redundant defines.php includes from CI build scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807604 (owner: 10Krinkle) [13:54:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) @BTullis just read your response on an-presto and see that you're experiencing this with stat1010. Thank you for digging into it more. [13:55:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30663 and previous config saved to /var/cache/conftool/dbconfig/20220630-135509-ladsgroup.json [13:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:16] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [13:55:28] (03CR) 10Ladsgroup: [C: 03+1] build: Make config gen signature for prod compatible with test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/807605 (owner: 10Krinkle) [13:55:36] !log installing firejail security updates on stretch [13:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:05] (03CR) 10Ladsgroup: [C: 03+1] noc: Add wiki.php to view a given wiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799352 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [13:56:09] (03CR) 10Btullis: Assign new password to Cassandra superuser (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652) (owner: 10Eevans) [13:56:37] (03CR) 10Majavah: "The API takes YAML not JSON, so this should probably use the generic `data` parameter instead." [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: 10Andrew Bogott) [14:00:54] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:01:06] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:02:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2154.codfw.wmnet with OS bullseye [14:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:27] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2154.codfw.wmnet with OS bullseye completed: - db2... [14:02:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10BTullis) Thanks @Cmjohnson - yes I think that this is very likely to be the same issue. That's useful that you've experienced exactly the same outcome on this as I... [14:05:58] (03PS3) 10Eevans: Assign new password to Cassandra superuser [labs/private] - 10https://gerrit.wikimedia.org/r/809639 (https://phabricator.wikimedia.org/T311652) [14:09:30] (03PS1) 10Btullis: Add a hiera alias for the cassandra superuser password to AQS [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) [14:10:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P30664 and previous config saved to /var/cache/conftool/dbconfig/20220630-141014-ladsgroup.json [14:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:57] (03CR) 10Btullis: "I think that to get a pcc run working we will need to merge this: https://gerrit.wikimedia.org/r/c/labs/private/+/809639" [puppet] - 10https://gerrit.wikimedia.org/r/809996 (https://phabricator.wikimedia.org/T311652) (owner: 10Btullis) [14:13:11] (03CR) 10Tchanders: [C: 04-1] "-1 just for the config name, now we've merged the other patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/808923 (https://phabricator.wikimedia.org/T310646) (owner: 10Hnowlan) [14:13:28] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: heavier banhammer for tegola object-server 'access logs' [puppet] - 10https://gerrit.wikimedia.org/r/809966 (https://phabricator.wikimedia.org/T297959) (owner: 10Filippo Giunchedi) [14:14:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] icinga: remove 'monitoring' from service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:14:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2160.codfw.wmnet with OS bullseye [14:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:59] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2160.codfw.wmnet with OS bullseye [14:15:46] (03CR) 10David Caro: [C: 03+2] P:openstack::puppetmaster: alert for puppet certs for deleted instances [puppet] - 10https://gerrit.wikimedia.org/r/806433 (owner: 10Majavah) [14:17:20] (03CR) 10David Caro: [C: 03+2] P:(toolforge|wmcs::paws)::prometheus: improve namespace filtering [puppet] - 10https://gerrit.wikimedia.org/r/807562 (owner: 10Majavah) [14:19:27] 10SRE, 10SRE Observability, 10User-fgiunchedi: systemd state on thanos-fe1001 is flapping - https://phabricator.wikimedia.org/T311322 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Optimistically resolving because I haven't seen any more failures! [14:20:00] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36148/console" [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:20:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [14:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:51] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] "PCC's happy on alert1001 + lvs1019, merging" [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:22:41] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [14:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [14:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:16] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:25:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P30665 and previous config saved to /var/cache/conftool/dbconfig/20220630-142519-ladsgroup.json [14:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:15] (03PS5) 10Filippo Giunchedi: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah) [14:29:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:56] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] icinga: remove 'monitoring' from service::catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793817 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:30:00] (03CR) 10CI reject: [V: 04-1] keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah) [14:30:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Ottomata) We will have to rebuild hadoop for bullsye, eh? {T310643} [14:32:07] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:32:14] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Papaul) [14:32:23] (03PS6) 10Filippo Giunchedi: keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah) [14:33:30] 10SRE, 10ops-codfw, 10decommission-hardware: decommission db2083 - https://phabricator.wikimedia.org/T311695 (10Papaul) 05Open→03Resolved complete [14:34:31] (03CR) 10Filippo Giunchedi: "I think this is now superseded by the blackbox http check" [puppet] - 10https://gerrit.wikimedia.org/r/786365 (owner: 10Jbond) [14:34:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2160.codfw.wmnet with reason: host reimage [14:34:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T309311)', diff saved to https://phabricator.wikimedia.org/P30666 and previous config saved to /var/cache/conftool/dbconfig/20220630-143436-ladsgroup.json [14:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:46] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [14:34:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "I think this is now superseded by the blackbox http check" [puppet] - 10https://gerrit.wikimedia.org/r/773272 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [14:35:12] (03CR) 10Filippo Giunchedi: [C: 03+2] keyholder: Collect Prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/787911 (owner: 10Majavah) [14:37:10] (03PS3) 10Cwhite: loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) [14:37:37] (03CR) 10Cwhite: loki: add ferm service to control api access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:38:07] <_joe_> !log updating python-poolcounter to 0.0.2 across the fleet [14:38:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2160.codfw.wmnet with reason: host reimage [14:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30667 and previous config saved to /var/cache/conftool/dbconfig/20220630-144024-ladsgroup.json [14:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:31] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [14:41:11] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:19] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:56] (03CR) 10Filippo Giunchedi: "See inline, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:41:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:42:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30668 and previous config saved to /var/cache/conftool/dbconfig/20220630-144204-ladsgroup.json [14:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:35] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:43] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 08s) [14:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2156.mgmt.codfw.wmnet with reboot policy FORCED [14:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:44] (03PS5) 10Cwhite: logstash: duplicate alert logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) [14:46:50] (03CR) 10Cwhite: logstash: duplicate alert logs for loki target (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:48:16] (03PS4) 10Cwhite: loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) [14:48:47] (03CR) 10Cwhite: loki: add ferm service to control api access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:49:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P30669 and previous config saved to /var/cache/conftool/dbconfig/20220630-144940-ladsgroup.json [14:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:42] (03PS1) 10Majavah: add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003 [14:52:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host thumbor1001.eqiad.wmnet [14:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2160.codfw.wmnet with OS bullseye [14:52:25] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [14:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:30] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2160.codfw.wmnet with OS bullseye completed: - db2... [14:54:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2156.codfw.wmnet with OS bullseye [14:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:26] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2156.codfw.wmnet with OS bullseye [14:54:31] (03CR) 10Herron: [C: 03+1] loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:56:59] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [14:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:09] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 10s) [14:57:10] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [14:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:36] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) >>! In T310980#8039322, @elukey wrote: > I checked in the jira that was pointed out earlier, and I noticed two things: > > 1) Most of the subtasks are related to finding how to test thin... [14:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30670 and previous config saved to /var/cache/conftool/dbconfig/20220630-145820-ladsgroup.json [14:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:27] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [14:58:34] (03CR) 10Filippo Giunchedi: [C: 03+1] loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:59:06] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: duplicate alert logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:59:45] (03PS1) 10Elukey: Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) [14:59:54] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) @Marostegui you can start putting the first 8 in service if you want. leave db2156 for now I am still doing install on it . I had some iss... [15:00:34] (03CR) 10CI reject: [V: 04-1] Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [15:00:39] (03PS2) 10Elukey: Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) [15:02:39] (03PS21) 10Volans: sre.network.configure-switch-interfaces: new [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [15:02:56] o/ eamedina47 [15:03:06] (03CR) 10Volans: [C: 03+1] "I've done a full pass and did also some minor adjustment. It looks good to me to start testing it live." [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [15:03:43] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui) @Papaul sounds good, so it'd be: 53, 54, 55, 57, 58, 59, 60 for now, right? [15:03:49] (03CR) 10Volans: [C: 03+1] sre.network.configure-switch-interfaces: new [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [15:03:57] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thumbor1001.eqiad.wmnet [15:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P30671 and previous config saved to /var/cache/conftool/dbconfig/20220630-150445-ladsgroup.json [15:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:59] !log ongoing PDU maintenance in Rack A4 CODFW [15:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:40] (03PS22) 10Volans: sre.network.configure-switch-interfaces: new [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [15:07:59] (03CR) 10Volans: [C: 03+1] sre.network.configure-switch-interfaces: new (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [15:09:25] (03CR) 10Cwhite: [C: 03+2] loki: add ferm service to control api access [puppet] - 10https://gerrit.wikimedia.org/r/809709 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [15:09:42] (03CR) 10Klausman: [C: 03+1] Add a new Eventgate stream for revision-score events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [15:10:13] (03PS1) 10Muehlenhoff: Add Alex Monk to contributors [puppet] - 10https://gerrit.wikimedia.org/r/810011 (https://phabricator.wikimedia.org/T308013) [15:11:41] 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10JArguello-WMF) [15:11:45] PROBLEM - Host ps1-a4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:11:53] 10SRE, 10Data-Engineering-Kanban, 10Event-Platform, 10serviceops, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10JArguello-WMF) [15:12:15] (03PS2) 10Cwhite: logstash: add loki output support [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) [15:13:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P30672 and previous config saved to /var/cache/conftool/dbconfig/20220630-151325-ladsgroup.json [15:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2156.codfw.wmnet with reason: host reimage [15:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10JArguello-WMF) [15:16:09] 10SRE, 10Data-Engineering-Kanban, 10Traffic, 10Data Engineering Planning: Spike: Investigate creating robust alerts to notify that caching nodes are not sending traffic data - https://phabricator.wikimedia.org/T304651 (10JArguello-WMF) [15:16:55] (03CR) 10Muehlenhoff: [C: 03+2] Add Alex Monk to contributors [puppet] - 10https://gerrit.wikimedia.org/r/810011 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:17:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2156.codfw.wmnet with reason: host reimage [15:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T309311)', diff saved to https://phabricator.wikimedia.org/P30673 and previous config saved to /var/cache/conftool/dbconfig/20220630-151951-ladsgroup.json [15:19:52] (03PS1) 10Muehlenhoff: Drop references to puppet source files [puppet] - 10https://gerrit.wikimedia.org/r/810014 [15:19:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:57] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [15:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:10] (03PS3) 10Cwhite: logstash: add loki output support [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) [15:22:14] (03PS2) 10Muehlenhoff: Drop references to puppet source files [puppet] - 10https://gerrit.wikimedia.org/r/810014 [15:23:26] PROBLEM - Host cp2027 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:38] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) >>! In T310980#8040624, @Eevans wrote: > I would propose that the way to think about this might be to ask ourselves how much runway we want/need from here to 4.x. 3.11.x is [[ https://ca... [15:23:44] PROBLEM - Host kubemaster2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:44] PROBLEM - Host ms-be2060 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:44] PROBLEM - Host ms-be2062 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:44] PROBLEM - Host mw2251 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:44] PROBLEM - Host mw2252 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:44] PROBLEM - Host mw2253 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:56] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:23:58] PROBLEM - Host ores2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:02] PROBLEM - Host ms-be2066 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:04] mmmmm [15:24:05] uh... [15:24:08] hmm [15:24:09] looks like a rack failure [15:24:12] PROBLEM - Host people2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:13] let's check [15:24:16] papaul: ^ [15:24:24] PROBLEM - Host cp2028 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:24] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:24:24] PROBLEM - Host ganeti2027 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:30] PROBLEM - Host logstash2033 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:33] In case there's something oging on there [15:24:50] PROBLEM - Host kafka-main2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:53] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stat1010.eqiad.wmnet with reason: host reimage [15:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:00] PROBLEM - Host backup2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:05] rack A4 [15:25:09] the rack should be A4 https://netbox.wikimedia.org/dcim/racks/46/ [15:25:12] PROBLEM - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:25:12] PROBLEM - Host backup2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:19] yeah, maybe the PDU work marostegui ? [15:25:20] jynus: ^ [15:25:22] PROBLEM - Host mc-gp2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:25] ongoing PDU maintenance in Rack A4 CODFW [15:25:30] :-( [15:25:30] PROBLEM - Host dbprov2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:37] not a big deal, it was idle [15:25:38] ahhhh thanks zabe [15:25:38] PROBLEM - Host ncredir2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:42] PROBLEM - Host kafkamon2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:44] PROBLEM - Host logstash2026 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:44] PROBLEM - Host orespoolcounter2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:25:56] it is networking ? [15:27:00] checking ores2001 [15:27:10] RECOVERY - Host cp2027 is UP: PING WARNING - Packet loss = 90%, RTA = 31.57 ms [15:27:10] RECOVERY - Host ganeti2027 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [15:27:12] RECOVERY - Host people2002 is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms [15:27:12] RECOVERY - Host logstash2026 is UP: PING OK - Packet loss = 0%, RTA = 33.00 ms [15:27:12] RECOVERY - Host cp2028 is UP: PING OK - Packet loss = 0%, RTA = 31.57 ms [15:27:12] RECOVERY - Host ncredir2001 is UP: PING OK - Packet loss = 0%, RTA = 31.84 ms [15:27:12] RECOVERY - Host backup2002 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [15:27:13] RECOVERY - Host logstash2033 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [15:27:14] RECOVERY - Host kafkamon2002 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [15:27:14] RECOVERY - Host kubemaster2001 is UP: PING OK - Packet loss = 0%, RTA = 38.62 ms [15:27:14] RECOVERY - Host mc-gp2001 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [15:27:16] RECOVERY - Host mw2252 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [15:27:16] RECOVERY - Host ores2001 is UP: PING OK - Packet loss = 0%, RTA = 31.76 ms [15:27:16] RECOVERY - Host kafka-main2001 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [15:27:18] RECOVERY - Host orespoolcounter2003 is UP: PING OK - Packet loss = 0%, RTA = 32.05 ms [15:27:25] Like a phoenix [15:27:29] elukey: let us now based on uptime [15:27:31] (03PS2) 10Muehlenhoff: ores: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809625 (https://phabricator.wikimedia.org/T308013) [15:27:37] if user impact I will start an incident [15:27:49] jynus: yeah seems so, I can access the OS but the network is not available [15:27:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10JArguello-WMF) [15:28:01] so network only [15:28:02] RECOVERY - Host ms-be2062 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [15:28:04] RECOVERY - Host mw2253 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [15:28:08] RECOVERY - Host mw2251 is UP: PING OK - Packet loss = 0%, RTA = 31.59 ms [15:28:09] now it works :D [15:28:10] RECOVERY - Host dbprov2001 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [15:28:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat1010.eqiad.wmnet with reason: host reimage [15:28:25] elukey: confirms uptime > 5 minutes, right ? [15:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:30] RECOVERY - Host ms-be2060 is UP: PING OK - Packet loss = 0%, RTA = 32.08 ms [15:28:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P30674 and previous config saved to /var/cache/conftool/dbconfig/20220630-152830-ladsgroup.json [15:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:40] jynus: nope 58 days (ores2001) [15:28:42] s [15:28:44] RECOVERY - Host backup2004 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [15:28:44] good [15:28:52] anyone can see user impact? [15:29:03] I will be looking at graphs and logs [15:29:14] RECOVERY - Host ms-be2066 is UP: PING OK - Packet loss = 0%, RTA = 32.72 ms [15:29:17] app servers and dbs shouldn't be impacted, but other services are active [15:29:19] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10akosiaris) [15:29:38] in theory we should be ok [15:29:44] (03CR) 10Filippo Giunchedi: [C: 03+1] Drop references to puppet source files [puppet] - 10https://gerrit.wikimedia.org/r/810014 (owner: 10Muehlenhoff) [15:29:47] I saw a spike of 5XX [15:30:10] althuogh it is not very large [15:30:31] https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&var-site=codfw&var-cache_type=varnish-text&var-cache_type=varnish-upload&var-status_type=5&var-method=GET&from=1656602583597&to=1656602995047&viewPanel=1 [15:30:33] (03CR) 10Muehlenhoff: [C: 03+2] ores: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/809625 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:30:40] PROBLEM - Host logstash2026 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:04] PROBLEM - Host logstash2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:31:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2156.codfw.wmnet with OS bullseye [15:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:31] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2156.codfw.wmnet with OS bullseye completed: - db2... [15:31:36] the services that should be active should have enough redundancy [15:31:38] PROBLEM - IPMI Sensor Status on cp2028 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:31:41] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Seen): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10akosiaris) As pointed out in T311732 (now merged as duplicate of... [15:31:51] except maybe ganeti (people?) [15:32:25] I think that all VMs down have redundancy [15:32:38] at least judging from a quick glance [15:32:57] (03PS1) 10Majavah: P:toolforge: drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/810022 [15:33:05] please anyone shout if you see anything still bad since 15:23 [15:33:05] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:33:17] (Emergency syslog message) firing: Alert for device asw-a-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:33:46] logstash fallout probably [15:34:02] well, logging pipeline in general [15:34:09] jynus: I am talking with Papaul on #dcops too [15:34:46] 10SRE, 10Data-Engineering-Kanban, 10Traffic, 10Data Engineering Planning (Sprint 01): Spike: Investigate creating robust alerts to notify that caching nodes are not sending traffic data - https://phabricator.wikimedia.org/T304651 (10JArguello-WMF) [15:36:03] (03PS1) 10Cwhite: logstash: increase dlq replicas to one [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740) [15:36:19] (03PS1) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [15:36:32] PROBLEM - Host logstash2026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:38] RECOVERY - Host logstash2033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [15:36:51] (JobUnavailable) firing: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:47] (Emergency syslog message) resolved: Device asw-a-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:37:49] asw-a4-codfw 3:37PM up 13 mins, [15:37:56] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (31) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, cloudservices1003, cloudservices1004, db2156, gitlab1001, gitlab1004, gitlab2001, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2 [15:37:56] nos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:38:18] mmm, 31 hosts is a lot of hosts [15:38:38] is that another fallout of the power issue, maybe? [15:38:39] (03CR) 10Filippo Giunchedi: "LGTM overall, thank you! See inline" [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah) [15:38:44] jynus: a rack is usually 40 devices, and that's without counting VMs [15:38:55] no, I mean for the puppet alert [15:38:58] (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:38:59] ah [15:39:07] probably puppet failed and trigger the alert [15:39:13] hopefully it will recover [15:39:23] some of those hosts are at eqiad [15:39:26] jynus: some hosts are in eqiad [15:39:27] eh [15:39:36] PROBLEM - IPMI Sensor Status on logstash2033 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:39:38] yeah, but those may be actual errors [15:39:59] some of the codfw one shouldn't alert (like db2*) [15:40:20] PROBLEM - IPMI Sensor Status on mc-gp2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:41:12] RECOVERY - Host logstash2026 is UP: PING OK - Packet loss = 0%, RTA = 31.63 ms [15:41:48] I think that was the last host to come up [15:42:11] let's wait for maintenance to complete [15:42:23] and the will review if there is any outstanding issue left [15:42:24] RECOVERY - Host logstash2026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.08 ms [15:43:34] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stat1010.eqiad.wmnet with OS bullseye [15:43:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T307525)', diff saved to https://phabricator.wikimedia.org/P30675 and previous config saved to /var/cache/conftool/dbconfig/20220630-154335-ladsgroup.json [15:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host stat1010.eqiad.wmnet with OS bullseye completed: - stat1010 (*... [15:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:45] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [15:44:44] (03CR) 10Filippo Giunchedi: add keyholder alerting (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah) [15:45:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) @nskaggs can you confirm the partman recipe you want? [15:45:45] (JobUnavailable) resolved: Reduced availability for job es_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:46:03] (03PS12) 10Vgutierrez: [WIP] esitest service [puppet] - 10https://gerrit.wikimedia.org/r/793561 (https://phabricator.wikimedia.org/T308799) (owner: 10BBlack) [15:46:05] (03PS1) 10Vgutierrez: trafficserver: Add ESI testing remap rule [puppet] - 10https://gerrit.wikimedia.org/r/810030 (https://phabricator.wikimedia.org/T308799) [15:46:24] (03PS2) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [15:46:26] (03PS1) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031 [15:47:38] (03PS2) 10Majavah: add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003 [15:49:15] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) @Marostegui you can do all first 8 [15:49:25] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36151/console" [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah) [15:50:14] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: increase dlq replicas to one [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740) (owner: 10Cwhite) [15:50:33] 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler: pcc-uploader failing on tools-puppetmaster-02 - https://phabricator.wikimedia.org/T311742 (10taavi) [15:52:16] (03CR) 10Majavah: add keyholder alerting (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah) [15:52:56] (03PS1) 10Zabe: acme_chief: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810032 (https://phabricator.wikimedia.org/T308013) [15:52:58] (03PS1) 10Zabe: certspotter: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810033 (https://phabricator.wikimedia.org/T308013) [15:53:00] (03PS1) 10Zabe: cumin: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810034 (https://phabricator.wikimedia.org/T308013) [15:53:02] (03PS1) 10Zabe: pdns_server: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810035 (https://phabricator.wikimedia.org/T308013) [15:53:04] (03PS1) 10Zabe: uwsgi: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/810036 (https://phabricator.wikimedia.org/T308013) [15:53:36] (03PS1) 10Jcrespo: InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) [15:53:52] (03CR) 10Herron: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740) (owner: 10Cwhite) [15:54:37] (03CR) 10Filippo Giunchedi: [C: 03+1] add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah) [15:54:39] (03CR) 10Filippo Giunchedi: [C: 03+2] add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah) [15:54:45] (03CR) 10Papaul: [C: 03+2] ADD new PDU model to ps1-a4-codfw [puppet] - 10https://gerrit.wikimedia.org/r/809977 (https://phabricator.wikimedia.org/T309957) (owner: 10Papaul) [15:55:52] (03CR) 10Cwhite: [C: 03+2] logstash: increase dlq replicas to one [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740) (owner: 10Cwhite) [15:56:17] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/810026 (https://phabricator.wikimedia.org/T311740) (owner: 10Cwhite) [15:57:43] (03Merged) 10jenkins-bot: add keyholder alerting [alerts] - 10https://gerrit.wikimedia.org/r/810003 (owner: 10Majavah) [15:58:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "Just a nit inline, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [15:58:12] papaul: going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/809977/ [15:59:03] (03CR) 10Jcrespo: "FYI" [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [15:59:07] cwhite: yes thanks [16:00:04] jbond and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:18] RECOVERY - Host ps1-a4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [16:00:22] (03PS4) 10Cwhite: logstash: add loki output support [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) [16:00:25] (03PS1) 10Majavah: keyholder::monitoring: remove nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/810039 [16:00:27] (03PS1) 10Majavah: keyholder::monitoring: drop nrpe plugin [puppet] - 10https://gerrit.wikimedia.org/r/810040 [16:00:29] (03PS1) 10Majavah: keyholder::monitoring: drop absented resources [puppet] - 10https://gerrit.wikimedia.org/r/810041 [16:00:50] (03CR) 10Cwhite: logstash: add loki output support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [16:01:36] RECOVERY - IPMI Sensor Status on cp2028 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:02:51] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Marostegui) Brilliant! Thanks [16:06:28] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) a:05RobH→03MoritzMuehlenhoff So I think this is now on Mortiz to roll out the monitoring changes (as he is in the above patchset) and no longer blocked on my testing. I'm... [16:07:22] (03PS1) 10Vgutierrez: varnish: Enable ESI for /esitest-fa8a495983347898/includer [puppet] - 10https://gerrit.wikimedia.org/r/810044 (https://phabricator.wikimedia.org/T308799) [16:07:25] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10BTullis) There was one more issue to address with these servers, which (thanks once again to @fgiunchedi) we have now identified and overcome. It was related to the enumeration/ord... [16:08:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T309311)', diff saved to https://phabricator.wikimedia.org/P30676 and previous config saved to /var/cache/conftool/dbconfig/20220630-160831-ladsgroup.json [16:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:37] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [16:09:39] RECOVERY - IPMI Sensor Status on logstash2033 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:10:13] (03CR) 10Andrea Denisse: [C: 03+1] "Looks good to me, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [16:10:26] RECOVERY - IPMI Sensor Status on mc-gp2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:11:52] (03PS2) 10Jcrespo: InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) [16:12:25] (03PS1) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054) [16:12:27] (03PS1) 10Jdlrobson: Enable Vector grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810046 (https://phabricator.wikimedia.org/T303484) [16:12:45] (Device rebooted) firing: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [16:13:41] 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [16:14:04] 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10Papaul) [16:14:48] PROBLEM - SSH on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:15:55] (03PS3) 10Giuseppe Lavagetto: mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 [16:15:57] (03PS2) 10Giuseppe Lavagetto: scap: use the new script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/810031 [16:15:59] (03PS1) 10Giuseppe Lavagetto: scap: drop unused parameters from the configuration [puppet] - 10https://gerrit.wikimedia.org/r/810048 [16:17:26] PROBLEM - Restbase root url on restbase2018 is CRITICAL: connect to address 10.192.48.120 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [16:17:36] (03CR) 10Herron: [C: 03+1] "LGTM, please see optional comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [16:17:45] (Device rebooted) resolved: Device ps1-a4-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [16:20:16] PROBLEM - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [16:20:19] (03CR) 10CI reject: [V: 04-1] mediawiki: add scap restarts script [puppet] - 10https://gerrit.wikimedia.org/r/810027 (owner: 10Giuseppe Lavagetto) [16:21:46] (03PS8) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [16:24:13] (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [16:27:35] 10SRE, 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10Vgutierrez) @AndyRussG currently in our CDN varnish and ATS runs on the same nodes. All the communication with backend servers/applayer is performed by ats-be (see https... [16:28:30] (03PS9) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [16:28:32] PROBLEM - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [16:28:38] !log volans@cumin1001 START - Cookbook sre.dns.netbox [16:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:17] (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [16:30:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) Summarizing yesterday's work: * Rob updated the BIOS (to latest) and the idrac (one step below latest, latest breaks https idrac interface) - NIC still doesn't det... [16:31:14] (03PS3) 10Jcrespo: InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) [16:31:20] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (30) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, cloudservices1003, cloudservices1004, gitlab1001, gitlab1004, gitlab2001, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, tha [16:31:20] 02, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [16:32:31] (03CR) 10David Caro: [C: 03+2] P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [16:32:35] (03CR) 10David Caro: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [16:32:48] (03PS5) 10David Caro: P:toolforge::checker: remove stretch endpoints [puppet] - 10https://gerrit.wikimedia.org/r/807170 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [16:32:54] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:36] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:54] PROBLEM - cassandra-c CQL 10.192.48.126:9042 on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [16:36:14] (03CR) 10David Caro: "Just one question, otherwise LGTM (that's a +1 from me if anyone gets to it before me)" [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah) [16:36:22] (03CR) 10Cwhite: [C: 03+2] "PCC noop: https://puppet-compiler.wmflabs.org/pcc-worker1001/36152/" [puppet] - 10https://gerrit.wikimedia.org/r/809722 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [16:36:32] (03PS4) 10Jcrespo: InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) [16:37:21] (03PS1) 10Sohom Datta: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) [16:40:01] (03CR) 10Sohom Datta: "Needs to be enabled after https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/806272 is merged 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [16:40:23] (03CR) 10Majavah: [V: 03+1] P:toolforge: drop stretch support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah) [16:40:58] (03PS1) 10Ladsgroup: Set GlobalBlockingAllowedRanges for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810055 (https://phabricator.wikimedia.org/T307648) [16:41:56] (03PS10) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [16:42:48] (03CR) 10CI reject: [V: 04-1] Set GlobalBlockingAllowedRanges for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810055 (https://phabricator.wikimedia.org/T307648) (owner: 10Ladsgroup) [16:43:35] (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [16:43:58] (03PS11) 10Vlad.shapik: Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) [16:44:12] (03PS2) 10Ladsgroup: Set GlobalBlockingAllowedRanges for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810055 (https://phabricator.wikimedia.org/T307648) [16:45:45] (03CR) 10CI reject: [V: 04-1] Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [16:46:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10wiki_willy) [16:47:18] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:47:41] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10wiki_willy) [16:52:10] (03PS1) 10Volans: tools/dump: don't dump cluster groups [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810056 [16:53:17] (03CR) 10Jcrespo: "Some example inputs, as with color things are clearer, I think:" [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [16:53:32] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock the dumps." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810056 (owner: 10Volans) [16:54:25] (03Merged) 10jenkins-bot: tools/dump: don't dump cluster groups [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/810056 (owner: 10Volans) [16:55:07] (03PS12) 10Volans: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi) [16:55:33] (03PS1) 10MSantos: mobileapps: bump to 2022-06-30-114235-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810057 [16:56:08] (03PS13) 10Volans: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi) [16:56:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Kanban, and 2 others: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10EChetty) [16:57:20] (03CR) 10Jcrespo: [C: 03+2] cli: Change logging to log on a different file each [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809589 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [16:57:29] (03CR) 10Jcrespo: [C: 03+2] Prepare for 0.1.3 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/809588 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [16:57:38] (03CR) 10Jcrespo: [C: 03+2] InteractiveQuery: Add additional cli messages after user testing [software/mediabackups] - 10https://gerrit.wikimedia.org/r/810037 (https://phabricator.wikimedia.org/T311215) (owner: 10Jcrespo) [16:59:32] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2022-06-30-114235-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810057 (owner: 10MSantos) [16:59:38] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt1016 - https://phabricator.wikimedia.org/T307825 (10Cmjohnson) [16:59:46] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt1016 - https://phabricator.wikimedia.org/T307825 (10Cmjohnson) 05Open→03Resolved [16:59:51] 10SRE, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531 (10Cmjohnson) [17:00:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T309311)', diff saved to https://phabricator.wikimedia.org/P30678 and previous config saved to /var/cache/conftool/dbconfig/20220630-170016-ladsgroup.json [17:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:23] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:01:40] (03CR) 10Volans: [C: 03+2] "Solved all the issues with python 3.10. Can be finally merged. Thanks Arzhel for the initial patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi) [17:03:35] (03Merged) 10jenkins-bot: mobileapps: bump to 2022-06-30-114235-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810057 (owner: 10MSantos) [17:04:36] (03Merged) 10jenkins-bot: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 (owner: 10Ayounsi) [17:05:35] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) [17:06:08] (03PS2) 10Majavah: P:toolforge: drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/810022 [17:06:26] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [17:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:48] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [17:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36153/console" [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah) [17:07:02] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:22] 10SRE, 10ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10Cmjohnson) 05Open→03Resolved removed the port [17:07:50] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:02] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:09:31] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:09:31] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) [17:09:33] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10Cmjohnson) there are several new servers that have been racked and tasks have not been updated. These will get updated as soon as possible [17:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:42] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic: Spike: Investigate creating robust alerts to notify that caching nodes are not sending traffic data - https://phabricator.wikimedia.org/T304651 (10EChetty) [17:09:53] 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal, 10Patch-For-Review: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) 05Open→03Resolved All open questions (or at least basic ones resolved), basically we will do a "b... [17:10:12] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P30679 and previous config saved to /var/cache/conftool/dbconfig/20220630-171522-ladsgroup.json [17:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10RobH) a:05Jclark-ctr→03RobH John fixed it, just pinged me in IRC. So I'll steal this back and open a case for the NIC issue. [17:23:08] (03CR) 10BCornwall: [C: 03+1] prometheus: probe DNS for (www).wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/809536 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [17:23:13] (03CR) 10BCornwall: [C: 03+1] prometheus: add initial blackbox dns probes for wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/809535 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [17:25:38] (03PS1) 10Stang: tawikisource: Add English alias for Author/Author_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810060 (https://phabricator.wikimedia.org/T165813) [17:30:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P30680 and previous config saved to /var/cache/conftool/dbconfig/20220630-173027-ladsgroup.json [17:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:53] (03CR) 10BCornwall: "Is a wrapper the best way forward for this? I'm normally wary of wrappers because of the risk of complexity and changing tooling from stan" [puppet] - 10https://gerrit.wikimedia.org/r/808984 (owner: 10Jbond) [17:35:08] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:35:40] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:39:36] (03PS2) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054) [17:40:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [17:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [17:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P30681 and previous config saved to /var/cache/conftool/dbconfig/20220630-174043-ladsgroup.json [17:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:49] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [17:45:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T309311)', diff saved to https://phabricator.wikimedia.org/P30682 and previous config saved to /var/cache/conftool/dbconfig/20220630-174532-ladsgroup.json [17:45:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [17:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:39] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [17:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [17:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T309311)', diff saved to https://phabricator.wikimedia.org/P30683 and previous config saved to /var/cache/conftool/dbconfig/20220630-174603-ladsgroup.json [17:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:32] (03PS1) 10Eigyan: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) [17:49:01] (03PS2) 10Eigyan: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) [17:50:50] (03PS1) 10Cwhite: beta-logs: set loki retention to 3d [puppet] - 10https://gerrit.wikimedia.org/r/810064 (https://phabricator.wikimedia.org/T222826) [17:52:02] (03CR) 10Cwhite: [C: 03+2] beta-logs: set loki retention to 3d [puppet] - 10https://gerrit.wikimedia.org/r/810064 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [17:52:06] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:52:56] (03CR) 10David Caro: [C: 03+2] P:toolforge: drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/810022 (owner: 10Majavah) [17:54:26] (03PS3) 10Eigyan: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) [17:55:35] 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10ssingh) [17:55:42] 10SRE, 10ops-eqiad: SSH on wtp1036.mgmt is flapping (CRITICAL) - https://phabricator.wikimedia.org/T311761 (10ssingh) p:05Triage→03Low [18:00:05] dduvall and hashar: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T1800). [18:00:12] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8040825, @elukey wrote: >>>! In T310980#8040624, @Eevans wrote: >> I would propose that the way to think about this might be to ask ourselves how much runway we... [18:00:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P30684 and previous config saved to /var/cache/conftool/dbconfig/20220630-180015-ladsgroup.json [18:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:22] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [18:01:11] 10SRE, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10ssingh) p:05Triage→03Medium [18:01:42] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 2 VMs requested for DSE Kubernetes Cluster control plane servers - https://phabricator.wikimedia.org/T311133 (10ssingh) p:05Triage→03Medium [18:01:49] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10ssingh) p:05Triage→03Medium [18:04:14] (03PS1) 10Majavah: wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 [18:09:50] (03CR) 10CI reject: [V: 04-1] wmcs: k8s: Fix cluster-info parsing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810067 (owner: 10Majavah) [18:10:30] (03PS1) 10Dduvall: all wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810068 (https://phabricator.wikimedia.org/T308071) [18:10:32] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810068 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall) [18:11:13] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810068 (https://phabricator.wikimedia.org/T308071) (owner: 10Dduvall) [18:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:15:17] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.18 refs T308071 [18:15:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P30685 and previous config saved to /var/cache/conftool/dbconfig/20220630-181520-ladsgroup.json [18:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:23] T308071: 1.39.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T308071 [18:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:20:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:06] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:27:37] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10Cmjohnson) 05Open→03Resolved these have been updated with the msw servers [18:30:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P30686 and previous config saved to /var/cache/conftool/dbconfig/20220630-183025-ladsgroup.json [18:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:46] (03CR) 10BCornwall: spdx: Add csr files to the list of files to ignore. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/808219 (owner: 10Jbond) [18:45:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T307525)', diff saved to https://phabricator.wikimedia.org/P30687 and previous config saved to /var/cache/conftool/dbconfig/20220630-184530-ladsgroup.json [18:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:38] T307525: Fix mismatching field type of user table for columns user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T307525 [18:47:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T309311)', diff saved to https://phabricator.wikimedia.org/P30688 and previous config saved to /var/cache/conftool/dbconfig/20220630-184708-ladsgroup.json [18:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:14] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [18:48:34] (03CR) 10Majavah: mediawiki: Split updateSpecialPages.php job to be per-shard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [18:55:17] (03PS3) 10Dzahn: mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) [19:02:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P30689 and previous config saved to /var/cache/conftool/dbconfig/20220630-190213-ladsgroup.json [19:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jgreen) [19:05:01] (03CR) 10Dzahn: [C: 03+2] Phabricator: Remove unneeded translation overrides [puppet] - 10https://gerrit.wikimedia.org/r/809907 (https://phabricator.wikimedia.org/T309746) (owner: 10Aklapper) [19:09:44] (03PS1) 10Stang: RecentChange: Straight join to actor table when needed [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809959 (https://phabricator.wikimedia.org/T311360) [19:14:58] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:19] (03CR) 10Dzahn: [C: 03+1] admin: allow sudo for jclark-ctr for cookbooks [puppet] - 10https://gerrit.wikimedia.org/r/809338 (https://phabricator.wikimedia.org/T306654) (owner: 10Ssingh) [19:17:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P30690 and previous config saved to /var/cache/conftool/dbconfig/20220630-191718-ladsgroup.json [19:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:28] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Arlolra) [19:21:56] (03PS2) 10Andrew Bogott: wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) [19:22:38] (03CR) 10CI reject: [V: 04-1] wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) (owner: 10Andrew Bogott) [19:24:01] (03PS1) 10Dzahn: httpbb: add tests for policy.wikimedia.org, fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/810073 (https://phabricator.wikimedia.org/T310738) [19:24:58] (03CR) 10Dzahn: "test for this here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/810073" [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [19:25:29] (03PS3) 10Andrew Bogott: wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) [19:27:02] (03CR) 10Ottomata: Add a new Eventgate stream for revision-score events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810007 (https://phabricator.wikimedia.org/T301878) (owner: 10Elukey) [19:27:45] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10Papaul) [19:32:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T309311)', diff saved to https://phabricator.wikimedia.org/P30691 and previous config saved to /var/cache/conftool/dbconfig/20220630-193223-ladsgroup.json [19:32:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [19:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:30] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [19:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [19:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T309311)', diff saved to https://phabricator.wikimedia.org/P30692 and previous config saved to /var/cache/conftool/dbconfig/20220630-193254-ladsgroup.json [19:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:13] (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:40:32] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:42:48] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:53:55] (03CR) 10RLazarus: "> So I am actually not 100% sure which should go first." [puppet] - 10https://gerrit.wikimedia.org/r/810073 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [19:54:34] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:56:03] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810077 (https://phabricator.wikimedia.org/T128546) [19:56:56] (03PS4) 10Dzahn: mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) [19:57:50] (03CR) 10Dzahn: httpbb: add tests for policy.wikimedia.org, fixcopyright.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810073 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [19:57:59] (03Abandoned) 10Dzahn: httpbb: add tests for policy.wikimedia.org, fixcopyright.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/810073 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [20:00:04] brennen: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220630T2000). [20:00:04] danisztls, kart_, Jdlrobson, koi, jan_drewniak, and eigyan: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] greetings [20:00:18] * kart_ is here [20:00:28] 0/ [20:00:35] o/ [20:00:36] o/ [20:00:45] 0/ ( I can do mine) [20:00:46] Greetings [20:00:58] * kart_ will also self-deploy [20:01:16] * urbanecm waves [20:01:25] o/ [20:01:56] howdy all [20:03:13] hi thcipriani. i'm around if any help with the window's needed :) [20:03:22] thanks urbanecm [20:03:29] thcipriani: I think we need to fix comment about Beta feature permission in wmf-config/InitialiseSettings.php#16976 [20:03:32] (03PS12) 10Thcipriani: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:04:33] (03PS1) 10BryanDavis: toolhub: Bump container version to 2022-06-30-170012-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810079 (https://phabricator.wikimedia.org/T303444) [20:04:47] * TheresNoTime is also around to help if the bottom of the barrel needs scraping :D [20:05:08] That reminds me to fix Beta Feature comment about cx. Last updated in 2019! :) [20:05:36] (03CR) 10Ottomata: [C: 03+2] Update analytics refine job version in test cluster [puppet] - 10https://gerrit.wikimedia.org/r/787718 (owner: 10Aqu) [20:05:47] (03PS1) 10Andrew Bogott: wmcs __init__.py: don't specify json_output when calling run_formatted_as [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810080 [20:05:51] kart_: which patch? [20:06:05] thcipriani: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/809165 [20:07:51] kart_: oh, yes that :D [20:08:04] I'll file a task for that after this [20:08:24] thcipriani: Thanks! [20:09:20] kart_ o/ [20:10:05] (03CR) 10CI reject: [V: 04-1] wmcs __init__.py: don't specify json_output when calling run_formatted_as [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810080 (owner: 10Andrew Bogott) [20:11:46] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2022-06-30-170012-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810079 (https://phabricator.wikimedia.org/T303444) (owner: 10BryanDavis) [20:12:33] (03CR) 10Thcipriani: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:13:50] (03Merged) 10jenkins-bot: QuickSurveys: Deploy research-incentive to jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806960 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:14:24] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:806960]] QuickSurveys: Deploy research-incentive to jawiki [20:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:40] 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10BCornwall) @Vgutierrez Indeed, do you have any reason to keep these *specific* instances around, or are you okay with a replacement? [20:15:01] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2022-06-30-170012-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/810079 (https://phabricator.wikimedia.org/T303444) (owner: 10BryanDavis) [20:15:25] thanks, thcipriani [20:16:14] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [20:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:27] thcipriani: pardon my ignorance, but i'm curious why a full scap for a config change? :-) is it that quick those days? [20:17:08] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [20:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:23] Oh, I was about to ask when log showed 'started scap..' :) [20:17:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:21] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [20:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:39] (03CR) 10Jsn.sherman: [C: 03+1] "LGTM! I can see now why we were not producing the expected stay previously. Nice work!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan) [20:18:44] urbanecm: heh, yeah, we're trying to start using a full scap more and we're testing a new scap command (not yet ready for primetime) called "scap backport" so I typed "scap backport 806960" and it merged the change, staged it, and started a sync (although it was supposed to stage it on mwdebug first :D) [20:19:08] oh wait: it did! [20:19:16] interesting initiative :) [20:19:18] (03PS3) 10Sbisson: Enable Wikistories on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) [20:19:35] danisztls: your change is on mwdebug1002, check please! [20:19:40] thcipriani: that's cool. [20:19:50] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [20:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:56] * bd808 still dreams of 100% automated, hands-free CD [20:20:00] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:20:27] (03PS4) 10Eigyan: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) [20:20:53] (03CR) 10KartikMistry: "Relend will update comment about Beta feature permission." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson) [20:21:08] Typo :/ [20:21:11] bd808: these are steps in that direction [20:21:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:21:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:22:00] brennen: *nod* mw-on-k8s seems like a good poke to push that way [20:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:35] thcipriani: the 'enabled' flag was set to false, can I do a follow-up? [20:23:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:27] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [20:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:39] (03CR) 10DDesouza: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [20:23:39] danisztls: I suppose although we've got a lot of patches in this window. Is this fine to sync? Or should I revert? [20:24:04] thcipriani: it's fine to sync [20:24:17] thcipriani: it will only be disabled [20:24:25] ah, ok, going live. [20:24:48] (03PS1) 10Dzahn: vtrs: add promtheus blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/810087 [20:24:50] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [20:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:30] !log Rebuilding Toolhub search indices [20:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:38] (03CR) 10Thcipriani: [C: 03+2] Enable Wikistories on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson) [20:29:25] (03Merged) 10jenkins-bot: Enable Wikistories on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809165 (https://phabricator.wikimedia.org/T311143) (owner: 10Sbisson) [20:30:33] thcipriani: You are deploying https://gerrit.wikimedia.org/r/809165, right? [20:30:44] stephanebisson will test it :) [20:31:02] kart_: cool, I just merged that one, still need to stage it [20:33:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:17] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) 05Open→03Resolved a:03Krinkle Any remaining "smells like opcache" problems we see can't be the cause of php-opcache revalidation m... [20:35:35] (03CR) 10Thcipriani: [C: 03+2] RecentChange: Straight join to actor table when needed [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809959 (https://phabricator.wikimedia.org/T311360) (owner: 10Stang) [20:36:39] thcipriani: FYI my first patch is beta cluster only if you want to hit +2 on that now [20:37:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:37:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:37:10] Jdlrobson: ah, neat, thanks for the poke that'll help :) [20:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:14] (03CR) 10RLazarus: "LGTM in principle. One bug in the tests but then it will be ready to go, and the rollout plan in the commit message sounds reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [20:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:27] (https://gerrit.wikimedia.org/r/c/810046/) [20:37:36] since the window is a bit packed :) [20:37:59] ah, but it looks like it has a dependency chain that gerrit isn't happy about when I try to rebase :\ [20:38:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:18] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:806960]] QuickSurveys: Deploy research-incentive to jawiki (duration: 23m 53s) [20:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:21] stephanebisson: your change is on mwdebug1002, check please [20:39:28] thcipriani, kart_: I'm on it, will need a good 5 minutes [20:39:50] thcipriani: ahh my bad [20:40:13] okay.. well hopefully the first config change will go quickly [20:40:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T309311)', diff saved to https://phabricator.wikimedia.org/P30693 and previous config saved to /var/cache/conftool/dbconfig/20220630-204043-ladsgroup.json [20:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:49] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [20:41:40] (03PS3) 10Thcipriani: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [20:41:54] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:41:54] stephanebisson: no problem. Beta feature part is activated. [20:42:12] (03PS3) 10DDesouza: QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) [20:42:58] (03CR) 10Thcipriani: [C: 03+2] Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [20:45:04] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [20:45:24] (03Merged) 10jenkins-bot: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810045 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [20:46:19] (03PS6) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [20:48:15] (03PS4) 10DDesouza: QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) [20:48:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:49:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:55] Is there someone around with the time to look at restbase2018? It seems...down(ish), and I can't get in via ssh. [20:50:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:25] To be clear: It's better up than down, but it's not an emergency :) [20:51:33] (03CR) 10CI reject: [V: 04-1] wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [20:51:47] The outage length is coming up on the hint window though, so bringing it up reduces the likelihood of any replica loss. [20:52:12] thcipriani all good, please sync [20:55:05] stephanebisson: thanks for checking, going now [20:55:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P30694 and previous config saved to /var/cache/conftool/dbconfig/20220630-205548-ladsgroup.json [20:55:50] PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:43] (03Merged) 10jenkins-bot: RecentChange: Straight join to actor table when needed [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809959 (https://phabricator.wikimedia.org/T311360) (owner: 10Stang) [20:59:05] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:809165|Enable Wikistories on idwiki (T311143)]] (duration: 03m 31s) [20:59:06] Jdlrobson: your second change is on mwdebug1002, check please [20:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:11] T311143: Deploy Wikistories to production - https://phabricator.wikimedia.org/T311143 [20:59:14] stephanebisson: your change should be live now [20:59:26] testing now... [20:59:28] (03PS2) 10Thcipriani: Enable Vector grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810046 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson) [20:59:33] (03CR) 10Thcipriani: [C: 03+2] Enable Vector grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810046 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson) [20:59:49] thcipriani Thanks! [21:00:35] thcipriani: it looks like my expression is wrong.. sigh [21:01:35] (03Merged) 10jenkins-bot: Enable Vector grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810046 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson) [21:02:03] bummer :( [21:02:14] koi: your wmf.18 change is on mwdebug1002, check please [21:03:07] looking [21:05:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:06:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:02] thcipriani: pretty sad, issue still exist, let's revert it [21:07:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:07:31] koi: :( ok [21:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:36] thanks for checking [21:07:45] (03PS1) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054) [21:07:50] thcipriani: i am doing it the old fashion way.. [21:08:10] the dblist expressions don't seem work how I think they work [21:08:16] (03PS1) 10Thcipriani: Revert "RecentChange: Straight join to actor table when needed" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962 [21:08:25] (03CR) 10Thcipriani: [C: 03+2] Revert "RecentChange: Straight join to actor table when needed" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962 (owner: 10Thcipriani) [21:09:14] sorry for another twenty minutes waiting 0 0 [21:10:34] koi: damn :/ [21:10:49] koi: no worries we can sync the other stuff while we're waiting on this backport: no big deal <3 [21:10:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P30695 and previous config saved to /var/cache/conftool/dbconfig/20220630-211053-ladsgroup.json [21:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:05] Jdlrobson: i didn't follow what happened closely, but if it just didn't work at all, i think that's because https://github.com/wikimedia/operations-mediawiki-config/blob/master/multiversion/MWConfigCacheGenerator.php#L27 wasn't updated AFAICS [21:11:57] urbanecm: thanks. This list is small enough I think it's fine to not use dblist expression. I just regret not doing that from the start now [21:12:38] Jdlrobson: yeah, for sure. enumerating wikis is generally preferred from dblists, so this is certainly better. was just for your information :) [21:12:52] <3 appreciate it urbanecm [21:12:52] thcipriani somehow my patch doesn't seem to be sync'd everywhere. When I refresh, sometimes the code is there sometimes it isn't. Is there a long replication delay or could there be a problem? [21:15:22] There's been trouble with things not fully syncing the last few days [21:15:31] unfortunately :/ [21:16:15] fun [21:16:31] RhinosF1 / urbanecm - know of that being tracked anywhere? [21:16:56] Jdlrobson: I'm going to revert yours for now so I can clear out the window [21:16:56] stephanebisson: would you mind sharing which mw server works correctly and which one doesn't? should be available in the `server` header in your devtools [21:17:06] brennen: I do not [21:17:10] thcipriani: hang on [21:17:14] urbanecm I'll try... [21:17:19] brennen: not 100% sure. dancy might know, as they helped with debugging it the other day. [21:17:22] thcipriani: mines time sensitive so it might be better if it just goes out as is [21:17:33] it's basically enabling to more wikis than it should do [21:17:42] what wikis shouldn't be there? [21:17:46] french wikipedia [21:18:10] urbanecm mw1414.eqiad.wmnet (not updated) [21:18:14] ideally it wouldn't go out there but the purpose of this change was to get it it in front of gadget developer eyes [21:18:17] I don't see that in the dblist file? I see frwikiquote and frwiktionary [21:18:23] yeh that's the problem :) [21:18:24] mw1414 already made problems the other day [21:18:30] the dblist did the invert of what i wanted [21:18:47] urbanecm mw1413.eqiad.wmnet (updated) [21:18:57] zabe: that's interesting [21:19:02] stephanebisson: thanks, that's helpful. mw1414 and [21:19:02] I think you are not going to have a clean revert path because of the beta cluster change as well [21:19:10] sorry for the mess :( [21:19:32] Jdlrobson: do you need to flip the false and the true in IS.php then? [21:19:35] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/810099 [21:19:42] that fixes the problem by getting rid of the dblist altogether [21:20:09] If the window needs to finish and I'm allowed I could see if cjming can help deal with this later today [21:20:26] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:28] urbanecm mw1369 also not up to date [21:20:38] thanks [21:21:00] urbanecm anything we can do to fully sync? [21:21:04] Hmmm [21:21:25] (03PS2) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054) [21:21:26] yeah. i just want to confirm it's the same thing that was observed yesterday [21:21:30] scap sync-wikiversions [21:21:37] (03PS3) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054) [21:21:43] That's what I used yesterday. [21:21:55] Jdlrobson: i'm around - i think Tyler is reverting your config now [21:22:10] and yes, both mw1414 and 1413 does have the new code, but the web server didn't pick it up for some reason. [21:22:45] (the other day it were at least mw1414, mw1415, mw1416, mw1417, mw1418, mw1447 and mw1450) [21:22:48] thcipriani: I'm not going to step on your toes too much, so i'll leave the blank scap sync-wikiversions to mitigate the issue on you :). meanwhile i'll phabricatorize it. [21:23:10] Thank you. [21:23:18] I'm going to go to sleep because I think far more capable hands are here [21:23:24] thanks urbanecm [21:23:28] good night RhinosF1 [21:23:35] Night urbanecm [21:24:06] To me this means that php-fpm restart isn't hitting all of the necessary hosts [21:24:28] or it is, but the script itself doesn't work (also plausible) [21:24:31] (03PS1) 10Thcipriani: Revert "Enable Vector grid on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810101 [21:24:33] (03PS1) 10Thcipriani: Revert "Vector: Deploy title above tabs to all opt-in wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810102 [21:24:34] doesn't *always work [21:24:52] Agreed [21:25:06] (03CR) 10Thcipriani: [C: 03+2] Revert "Enable Vector grid on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810101 (owner: 10Thcipriani) [21:25:11] (03CR) 10Thcipriani: [C: 03+2] Revert "Vector: Deploy title above tabs to all opt-in wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810102 (owner: 10Thcipriani) [21:25:16] Heading back to my desk to investigate [21:25:22] (03Abandoned) 10Andrew Bogott: wmcs __init__.py: don't specify json_output when calling run_formatted_as [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810080 (owner: 10Andrew Bogott) [21:25:50] (03Merged) 10jenkins-bot: Revert "Enable Vector grid on beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810101 (owner: 10Thcipriani) [21:25:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T309311)', diff saved to https://phabricator.wikimedia.org/P30696 and previous config saved to /var/cache/conftool/dbconfig/20220630-212558-ladsgroup.json [21:25:59] (03Merged) 10jenkins-bot: Revert "Vector: Deploy title above tabs to all opt-in wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810102 (owner: 10Thcipriani) [21:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:05] (03PS2) 10Thcipriani: tawikisource: Add English alias for Author/Author_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810060 (https://phabricator.wikimedia.org/T165813) (owner: 10Stang) [21:26:05] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [21:26:09] (03CR) 10Thcipriani: [C: 03+2] tawikisource: Add English alias for Author/Author_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810060 (https://phabricator.wikimedia.org/T165813) (owner: 10Stang) [21:26:21] (03PS4) 10Andrew Bogott: wmcs-enc-cli.py: fix args passed to requests.post [puppet] - 10https://gerrit.wikimedia.org/r/809721 (https://phabricator.wikimedia.org/T274666) [21:26:23] (03PS1) 10Andrew Bogott: wmcs-makedomain: forward to python3 [puppet] - 10https://gerrit.wikimedia.org/r/810103 [21:27:01] (03Merged) 10jenkins-bot: tawikisource: Add English alias for Author/Author_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810060 (https://phabricator.wikimedia.org/T165813) (owner: 10Stang) [21:27:33] koi: your wmf-config change is live on mwdebug1002 [21:27:44] (03Merged) 10jenkins-bot: Revert "RecentChange: Straight join to actor table when needed" [core] (wmf/1.39.0-wmf.18) - 10https://gerrit.wikimedia.org/r/809962 (owner: 10Thcipriani) [21:27:50] looking [21:28:24] thcipriani do you plan on running the sync mentioned above to make sure my config change syncs everywhere? [21:28:59] thcipriani: LGTM [21:29:08] stephanebisson: sure, this will go live with koi's change [21:29:31] thcipriani thanks [21:29:58] brennen: dancy: thcipriani: stephanebisson: fyi: i phabricatorized the issue as https://phabricator.wikimedia.org/T311788. [21:30:08] thx [21:31:19] not sure if we should send sth like "sync everything twice" to ops-l, or if fixing it would be quick. [21:31:49] dancy: FWIW it's restarting 307 + 9 hosts which I note is different than the 348 + 9 it syncs to :) [21:31:51] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-makedomain: forward to python3 [puppet] - 10https://gerrit.wikimedia.org/r/810103 (owner: 10Andrew Bogott) [21:32:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:33:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:55] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810060|tawikisource: Add English alias for Author/Author_talk namespace (T165813)]] (duration: 03m 42s) [21:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:04] T165813: Create Author: namespace on Tamil wikisource - https://phabricator.wikimedia.org/T165813 [21:34:15] ^ koi and stephanebisson should be live now [21:34:16] back [21:34:24] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:34:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:08] thcipriani looks good, thanks again [21:35:18] thcipriani: There were no complains from the restart script during the deployment? [21:35:37] stephanebisson: that's good, yw [21:35:40] dancy: nope [21:35:41] thcipriani: thanks, one more thing, would you like to run namespaceDupes.php at tawikisource as mentioned in T165813 [21:35:45] gah [21:35:50] koi: sure [21:36:51] koi: blerg, looks like there are a few manual fixes needed here [21:37:34] koi: https://phabricator.wikimedia.org/P30697 [21:37:47] I'd like a copy of the deployment transcript [21:37:58] thcipriani: can you run it with something like --add-prefix=BROKEN (so it can be resolved on wiki)? :) [21:38:02] (03PS1) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 [21:38:40] (03PS1) 10Andrew Bogott: Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 [21:38:44] urbanecm: that one is news to me! Will it only do it with conflicts? (/me hopes) [21:38:55] yeah. it's like a backup plan :) [21:39:01] thcipriani: i'll follow up with Clare regarding my patches. Sorry to add to the drama today! [21:39:09] Jdlrobson: <3 [21:39:12] (03PS7) 10Andrew Bogott: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [21:39:29] urbanecm: TIL, I'll do that [21:39:35] 👍 [21:39:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:43] (03PS2) 10Andrea Denisse: Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 [21:40:45] (03CR) 10Andrew Bogott: [C: 03+2] striker: connect docker container directly to host network [puppet] - 10https://gerrit.wikimedia.org/r/809714 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [21:40:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:40:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:40:48] (03CR) 10CI reject: [V: 04-1] Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [21:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:52] thanks urbanecm -- koi all done, I'll get a paste on that ticket [21:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:14] (03PS2) 10Thcipriani: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810077 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:42:20] thanks a lot! [21:42:25] (03CR) 10Thcipriani: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810077 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:43:34] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810077 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:43:38] (03CR) 10CI reject: [V: 04-1] Add PHP 7.4 dependencies for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/810106 (owner: 10Andrea Denisse) [21:44:06] (03CR) 10CI reject: [V: 04-1] wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [21:44:11] (03CR) 10CI reject: [V: 04-1] Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott) [21:45:33] dancy: here was the output of the deploy that evidently didn't go everywhere: https://phabricator.wikimedia.org/P30698 [21:46:29] thx [21:46:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:05] jan_drewniak: still using portals/sync-portals, correct? [21:47:30] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:47:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:47:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:30] thcipriani: yeah thanks [21:48:35] jan_drewniak: it's live on mwdebug1002 if there are things you need to check? [21:48:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:56] (03PS5) 10Thcipriani: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan) [21:48:58] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:49:28] (03CR) 10Thcipriani: [C: 03+2] [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan) [21:49:33] thcipriani: ok looks good [21:49:44] jan_drewniak: cool, thanks for checking, running sync-portals now [21:50:18] (03Merged) 10jenkins-bot: [wmf-config]: Deploy GDI Survey 2 on EN and FA wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810063 (https://phabricator.wikimedia.org/T311759) (owner: 10Eigyan) [21:52:15] (03PS1) 10Jdlrobson: Enable grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810109 (https://phabricator.wikimedia.org/T303484) [21:53:16] !log thcipriani@deploy1002 Synchronized portals/wikipedia.org/assets: Config: [[gerrit:810077|Bumping portals to master (T128546)]] (duration: 03m 24s) [21:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:28] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [21:53:36] (03PS4) 10Jdlrobson: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054) [21:53:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:54:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:51] !log thcipriani@deploy1002 Synchronized portals: Config: [[gerrit:810077|Bumping portals to master (T128546)]] (duration: 03m 34s) [21:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:59] ^ jan_drewniak all done! [21:57:06] eigyan: still around? [21:57:15] hes inideed! [21:57:18] yes [21:57:40] sorry got so excited to type...lol [21:58:00] :D [21:58:22] eigyan: your change is live on mwdebug1002, check please [21:58:29] Thank you thcipriani I will check [21:58:54] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:00:49] thcipriani surveys are working as expected. Thank you so much for enduring this late deploy for us all! [22:01:07] eigyan: glad to hear it, going live [22:01:13] (03PS5) 10Dzahn: mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) [22:01:14] Excellent! [22:01:29] (03CR) 10Dzahn: mediawiki: redirect policy and related sites to wikimediafoundation.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [22:01:37] (03CR) 10Dzahn: "fixed yaml trap" [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [22:03:34] (03CR) 10RLazarus: [C: 03+1] mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [22:05:15] 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn) [22:06:25] (03PS2) 10Dzahn: vtrs: add promtheus blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/810087 (https://phabricator.wikimedia.org/T303190) [22:06:33] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810063|[wmf-config]: Deploy GDI Survey 2 on EN and FA wikis (T311759)]] (duration: 03m 16s) [22:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:40] T311759: Deploy GDI Safety Survey Wave 2 on EN and FA wikis - https://phabricator.wikimedia.org/T311759 [22:06:51] ^ eigyan should be live now! [22:06:54] kudos [22:07:11] Awesome many thanks to you thcipriani! [22:08:01] (03CR) 10Clare Ming: [C: 03+2] Enable grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810109 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson) [22:08:27] fyi - just doing a few more backports before closing this window [22:08:50] (03Merged) 10jenkins-bot: Enable grid on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810109 (https://phabricator.wikimedia.org/T303484) (owner: 10Jdlrobson) [22:10:01] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) [22:10:21] (03CR) 10Clare Ming: [C: 03+2] Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [22:11:18] (03Merged) 10jenkins-bot: Vector: Deploy title above tabs to all opt-in wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810099 (https://phabricator.wikimedia.org/T310054) (owner: 10Jdlrobson) [22:11:20] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) @Krinkle and I agreed on doing this tomorrow at 14:00 PST [22:11:45] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [22:11:47] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) 05Open→03In progress [22:13:27] Jdlrobson: config change is up on mwdebug1002 if you want to verify [22:13:52] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:810109|Enable grid on beta cluster (T303484)]] (duration: 03m 43s) [22:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:59] T303484: Introduce basic grid system to modern Vector - https://phabricator.wikimedia.org/T303484 [22:14:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:14:47] cjming: can you do 809961? if not no problem [22:15:14] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:15:29] danisztls: can you add to deployment cal? i can do it - that's just enabling the survey right? [22:16:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:42] cjming: done, thanks! [22:16:50] yes, just enabling it [22:17:04] np [22:17:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:17:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:02] sync away! [22:19:08] syncing! [22:19:43] (03PS5) 10Clare Ming: QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [22:20:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:53] (03CR) 10Clare Ming: [C: 03+2] QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [22:22:40] (03Merged) 10jenkins-bot: QuickSurveys: Enable 'research-incentive' survey on 'jawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/809961 (https://phabricator.wikimedia.org/T311015) (owner: 10DDesouza) [22:23:03] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:810099|Vector: Deploy title above tabs to all opt-in wikis (T310054)]] (duration: 03m 36s) [22:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:09] T310054: Deploy new toolbar order - https://phabricator.wikimedia.org/T310054 [22:23:15] Jdlrobson: ^^ live [22:24:12] danisztls: can you see survey on mwdebug1002? [22:24:42] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/808984 (owner: 10Jbond) [22:24:43] cjming: yes, lgtm [22:24:50] cool - going live then [22:25:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:26:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:26] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:28:45] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:809961|QuickSurveys: Enable 'research-incentive' survey on 'jawiki' (T311015)]] (duration: 03m 40s) [22:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:51] T311015: Deploy QuickSurvey on Japanese Wikipedia - https://phabricator.wikimedia.org/T311015 [22:28:56] (03PS1) 10BryanDavis: striker: Bump container version to 2022-06-29-004157-production [puppet] - 10https://gerrit.wikimedia.org/r/810118 [22:29:06] danisztls: survey should be live [22:30:52] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:30:52] does it take a while to sync? [22:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:18] danisztls: it's done syncing - should be live now [22:32:03] * dancy eyes [22:34:34] cjming: it's showing the survey on mwdebug but not on prod [22:34:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:06] * dancy shakes a fist [22:36:21] well... perhaps I should reserve fist shaking until it's confirmed that this is the same problem. :-) [22:36:35] it's showing now [22:36:38] whew! [22:36:39] oh good [22:36:43] * dancy unshakes fist [22:36:45] lol [22:37:02] I'm curious about the cause now [22:37:14] Thanks cjming [22:37:17] np! [22:37:24] closing the window at long last [22:39:00] !log end of UTC late backport and config training window [22:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:22] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:53:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2161.mgmt.codfw.wmnet with reboot policy FORCED [22:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2162.mgmt.codfw.wmnet with reboot policy FORCED [22:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:59] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2022-06-29-004157-production [puppet] - 10https://gerrit.wikimedia.org/r/810118 (owner: 10BryanDavis) [23:39:13] (KubernetesRsyslogDown) firing: rsyslog on kubemaster2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:39:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2162.mgmt.codfw.wmnet with reboot policy FORCED [23:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2161.mgmt.codfw.wmnet with reboot policy FORCED [23:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:42:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2163.mgmt.codfw.wmnet with reboot policy FORCED [23:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2164.mgmt.codfw.wmnet with reboot policy FORCED [23:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:46:25] (03PS6) 10Krinkle: mediawiki: redirect policy and related sites to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/809324 (https://phabricator.wikimedia.org/T310738) (owner: 10Dzahn) [23:48:54] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2164.mgmt.codfw.wmnet with reboot policy FORCED [23:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2164.mgmt.codfw.wmnet with reboot policy FORCED [23:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:57:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2164.mgmt.codfw.wmnet with reboot policy FORCED [23:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:58] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down